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Abstract 


An  a-approximation  algorithm  is  an  algorithm  guaranteed  to  output  a  solu¬ 
tion  that  is  within  an  a  ratio  of  the  optimal  solution.  We  are  interested  in  the 
following  question:  Given  an  NP-hard  optimization  problem,  what  is  the  best 
approximation  guarantee  that  any  polynomial  time  algorithm  could  achieve? 

We  mostly  focus  on  studying  the  approximability  of  two  classes  of  NP-hard 
problems:  Constraint  Satisfaction  Problems  (CSPs)  and  Computational  Learn¬ 
ing  Problems. 

For  CSPs,  we  mainly  study  the  approximability  of  Max  Cut,  Max  3-CSP, 
Max  2-Linr,  vertex-pricing,  as  well  as  serval  variants  of  the  Unique- 
Games. 

•  The  problem  of  Max  Cut  is  to  find  a  partition  of  a  graph  so  as  to  max¬ 
imize  the  number  of  edges  between  the  two  partitions.  Assuming  the 
Unique  Games  Conjecture,  we  give  a  complete  characterization  of  the  ap¬ 
proximation  curve  of  the  Max  Cut  problem:  for  every  optimum  value  of 
the  instance,  we  show  that  certain  SDP  algorithm  with  RPR2  rounding 
always  achieve  the  optimal  approximation  curve. 

•  The  input  to  a  3-CSP  is  a  set  of  Boolean  constraints  such  that  each  con¬ 
straint  contains  at  most  3  Boolean  variables.  The  goal  is  to  find  an  as¬ 
signment  to  these  variables  to  maximize  the  number  of  satisfied  con¬ 
straints.  We  are  interested  in  the  case  when  a  3-CSP  is  satisfiable,  i.e., 
there  does  exist  an  assignment  that  satisfies  every  constraint.  Assum¬ 
ing  the  ci-to-1  conjecture  (a  variant  of  the  Unique  Games  Conjecture),  we 
prove  that  it  is  NP-hard  to  give  a  better  than  5/8-approximation  for  the 
problem.  Such  a  result  matches  a  SDP  algorithm  by  Zwick  which  gives 
a  5/8-approximation  problem  for  satisfiable  3-CSP.  In  addition,  our  result 
also  conditionally  resolves  a  fundamental  open  problem  in  PCP  theory  on 
the  optimal  soundness  for  a  3-query  nonadaptive  PCP  system  for  NP  with 
perfect  completeness. 

•  The  problem  of  Max  2-Lin^  involves  a  linear  systems  of  integer  equa¬ 
tions;  these  equations  are  so  simple  such  that  each  equation  contains  at 
most  2  variables.  The  goal  is  to  find  an  assignment  to  the  variables  so  as 
to  maximize  the  total  number  of  satisfied  equations.  It  is  a  natural  gener¬ 
alization  of  the  Unique  Games  Conjecture  which  address  the  hardness  of 
the  same  equation  systems  over  finite  fields.  We  show  that  assuming  the 
Unique  Games  Conjecture,  for  a  Max  2-LlN/  instance,  even  that  there 
exists  a  solution  that  satisfies  1  -  e  of  the  equations,  it  is  NP-hard  to  find 
one  that  satisfies  e  of  the  equations  for  any  e  >  0. 


•  The  problem  of  VERTEX-PRICING  involves  of  a  set  of  customers  each  of 
which  is  interested  in  buying  a  set  of  items.  VERTEX-PRICING^  is  the 
special  case  when  each  customer  is  interested  in  at  most  k  of  the  items. 
All  of  the  buyers  are  single  minded,  which  means  that  each  of  the  buyer 
would  buy  all  the  items  they  have  interest  on  if  the  total  cost  of  the  items 
is  within  their  budget.  The  algorithmic  task  is  to  price  each  item  with 
so  as  to  maximize  the  overall  profit.  When  each  item  is  priced  positive 
profit  margin,  it  is  known  that  there  is  a  O  (^-approximation  algorithm 
for  the  problem.  We  prove  that  in  contrast  for  the  very  simple  case  of 
VERTEX- PRICING3,  when  the  seller  is  allowed  to  price  some  of  the  items 
with  negative  profit  margin  (in  which  case  more  profit  could  possibly 
be  achieved),  there  is  no  polynomial  time  approximation  algorithm  that 
gives  constant  approximation  to  the  problem  assuming  the  Unique  Games 
Conjecture. 

For  the  learning  problems,  our  results  mostly  involve  showing  that  learning 
tasks  are  hard  for  many  basic  function  classes  under  the  agnostic  learning 
model.  In  particular,  we  proved  that  the  following  two  results  on  agnostic 
learning  monomials  and  low  degree  polynomial  threshold  functions: 

•  Our  first  result  is  about  hardness  of  learning  monomials.  We  prove  that 
given  a  set  of  examples,  even  that  there  exists  a  monomial  that  is  con¬ 
sistent  with  1  -e  of  the  examples,  it  is  NP-hard  to  find  a  (1/2  +  e)  good 
hypothesis  even  we  are  allowed  to  output  a  linear  threshold  function,  for 
any  e  >  0.  Our  result  rules  out  the  possibility  of  using  linear  classifiers 
such  as  Winnow  and  SVM  to  agnostically  learn  monomials. 

•  Our  second  result  is  on  the  hardness  of  learning  polynomial  threshold 
functions  (PTFs).  We  prove  that  assuming  the  Unique  Games  Conjecture, 
given  a  set  of  examples,  even  that  there  exists  a  low  degree  PTF  that  is 
consistent  with  1  -  e  of  the  examples,  it  is  NP-hard  to  find  such  a  one  that 
is  1/2  +  e  good  for  any  e  >  0.  In  the  language  of  learning,  we  show  there  is 
no  better-than-trivial  proper  learning  algorithm  that  agnostically  learns 
low  degree  PTFs. 
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1.1  Motivation 

For  a  vast  variety  of  applications  in  computer  science  and  engineering,  the  central  task  is 
to  design  efficient  algorithms  for  certain  optimization  problem.  For  example,  in  machine 
learning,  one  of  the  major  goal  is  to  find  a  predication  rule  with  the  maximum  accuracy 
on  a  particular  domain  of  data;  in  computer  networking,  a  common  task  is  to  design  a 
protocol  that  gives  the  minimum  delay  of  the  transmissions. 

Unfortunately,  for  a  huge  class  of  optimization  problems,  it  is  NP-hard  to  find  the  opti¬ 
mum  solution.  Under  the  widely  held  belief  that  P  f  NP,  there  does  not  exist  a  polynomial 
time  algorithm  for  all  of  these  NP-hard  optimization  problems. 

To  cope  with  the  NP-hardness,  there  has  been  a  great  interest  of  designing  efficient  ap¬ 
proximation  algorithms  that  return  a  suboptimal  solution  provably  close  to  the  optimum. 
Formally,  an  algorithm  is  called  an  a-approximation  if  it  guarantees  to  output  a  solution 
that  is  within  a  factor  a  of  the  optimum.  When  a  =  1,  the  algorithm  solves  the  problem 
exactly.  Ideally,  we  want  to  design  an  algorithm  with  its  approximation  ratio  a  being  as 
close  to  1  as  possible,  while  still  require  the  algorithm  to  have  an  polynomial  running  time. 
This  raises  the  following  natural  question: 


Question  Given  an  NP-hard  problem,  what  is  the  best  polynomial  time  approximation 
algorithm ? 

Answering  the  above  question  involves  proofs  from  two  sides:  first  we  need  to  exhibit 
a  polynomial  time  algorithm  that  has  certain  approximation  guarantee;  second  we  need 
to  prove  the  impossibility  of  getting  better  polynomial  time  approximation  algorithms. 

This  thesis  is  about  to  study  the  optimal  approximation  threshold  for  a  variety  of  im¬ 
portant  and  natural  NP-hard  optimization  problems. 


1.2  Problems  Studied  in  This  Thesis 

To  give  the  reader  a  sense  of  the  optimization  problems  studied  in  the  thesis,  we  list  some 
of  them  here: 

1.  (Max  2-LiNz)  We  are  given  a  set  of  linear  inequations  and  these  equations  are  so 
simple  such  that  each  equation  contains  at  most  2  variables.  Can  we  find  an  assign¬ 
ment  to  the  variables  so  as  to  maximize  the  number  of  satisfied  equations? 

2.  (MON-MA)  We  want  to  decide  whether  an  E-mail  is  spam  or  not.  A  common  ap¬ 
proach  is  to  look  at  whether  these  E-mails  contain  certain  set  of  key  words  or  not. 
Suppose  there  is  a  collection  of  key  words  such  that  with  high  accuracy,  E-mails  con¬ 
taining  all  of  them  are  spam  (and  vice  versa).  Given  a  set  of  E-mails  that  are  labelled 
with  whether  they  are  spam  or  not,  can  we  find  a  way  to  classify  other  unlabelled 
E-mails  with  high  accuracy? 

3.  (Max  Cut)  Given  a  graph,  can  we  partition  it  into  two  parts  so  as  to  maximize  the 
total  number  of  edges  between  them? 
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4.  (Max  3-CSP)  Given  a  set  of  Boolean  constraints  such  that  each  of  the  constraint 
contains  at  most  3  variables,  can  we  efficiently  find  a  solution  that  satisfies  all  of 
them  (if  there  exists  such  a  solution)? 

5.  (VERTEX-PRICING)  A  set  of  buyers  each  of  which  is  interested  in  a  bundle  of  items. 
These  buyers  are  single  minded  such  that  they  either  buy  the  whole  bundle  if  the 
total  cost  is  within  their  budget  or  they  will  buy  nothing.  The  question  is  how  to 
price  each  item  so  as  to  maximize  the  overall  profit. 

Generally  speaking,  the  problems  studied  in  this  thesis  come  from  the  following  two 
categories:  i)  Constraint  Satisfaction  Problem  (CSP);  ii)  Computational  Learning.  Below, 
we  give  a  high  level  overview  of  these  two  classes  of  problems. 

1.2.1  Constraint  Satisfaction  Problem  (CSP) 

Briefly  speaking,  a  Constraint  Satisfaction  Problem  ( CSP)  involves  a  system  of  constraints 
on  a  set  of  variables.  Given  a  CSP,  the  natural  algorithmic  task,  called  “Max-CSP”,  is  to 
find  an  assignment  to  the  variables  such  that  the  total  number  of  satisfied  constraints  are 
as  large  as  possible. 

While  the  above  definition  of  CSPs  is  rather  abstract,  many  natural  optimization  prob¬ 
lems  fall  into  the  class  of  CSPs.  One  concrete  example  of  a  CSP  is  the  linear  equation 
system,  which  consists  of  a  set  of  linear  equations  over  a  set  of  variables.  The  correspond¬ 
ing  optimization  problem  is  to  find  an  assignment  of  the  variables  of  the  system  to  satisfy 
as  many  equations  as  possible  (if  not  all).  In  addition  to  linear  systems,  we  can  specialize 
a  Max  CSP  by  using  other  types  of  constraints  to  get  many  of  the  most  canonical  NP- 
hard  optimization  problems  such  as  Max  Cut,  Max  3-Sat  and  Max  Sat. 

CSPs  also  have  a  deep  root  in  the  study  of  theoretical  computer  science.  The  NP- 
hardness  of  Max  Cut  and  Max-3SAT  came  along  with  the  very  beginning  of  the  NP- 
completeness  theory  [34,  88]  in  the  seventies.  Shortly  after  that,  a  seminal  paper  by  John¬ 
son  [82],  which  is  a  foundational  paper  of  the  filed  of  approximation  algorithms,  designed 
algorithms  for  many  NP-hard  optimization  problems  including  Max-SAT,  Max-3SAT  as 
well  as  Set  Cover,  Coloring  and  Maximum  Independent  Set.  Since  then,  there  has  been 
a  flurry  of  work  that  successfully  designing  approximation  algorithms  for  various  CSPs. 
Many  of  the  early  algorithms  are  based  on  Linear  Programming;  in  a  breakthrough  on 
both  theory  and  practice  happened  in  1994,  Goemans  and  Williamson  [59]  gave  a  Semidef- 
inite  Programming  (SDP)  rounding  algorithm  achieving  a  0.878  approximation  guarantee 
for  MAX  CUT;  it  is  the  first  algorithm  with  a  nontrivial  approximation  for  MAX  CUT.  After 
that,  there  is  a  tremendous  interest  in  designing  SDP  based  approximation  algorithm  for 
various  CSPs  [11,  15,  29,  32,  37,  49,  110,  123,  141,  142], 

Compared  with  the  quick  development  at  the  algorithm  side,  there  has  been  a  rela¬ 
tively  slow  progress  on  proving  hardness  of  approximation  results  until  the  early  nineties. 
The  first  major  breakthrough  is  the  celebrated  PCP  theorem,  which  is  equivalent  to  the 
following  statement:  there  exists  some  constant  e  >  0  such  that  given  a  3-SAT  instance 
that  can  be  satisfied  by  some  assignment,  it  is  NP-hard  to  find  a  assignment  that  satisfies 
1  -  e  fraction  of  the  constraints.  This  implies  that  it  is  NP-hard  to  have  a  appro xima- 
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tion  better  than  1  -  e  for  Max  3-Sat.  Since  then,  people  obtain  many  improved  hardness 
results  for  various  kinds  of  CSPs.  In  a  seminal  work  by  Hastad  [76],  he  improved  the 
hardness  of  approximation  ratio  of  Max  3-Sat  from  1  —  e  to  In  the  same  work,  he  gave 
a  lot  of  other  inapproximability  results  which  included  showing  that  Max  3-LINg  is  hard 
to  approximate  beyond  the  trivial  1/q  ratio. 

We  now  have  optimal  (i.e.,  matching)  approximation  algorithms  and  NP-hardness- 
of-approximation  results  for  many  key  problemsrMAX  &-LINg  for  k  >  3  [76],  Max  3- 
Sat  [76,  87,  143],  and  a  few  other  Max  &-CSP  problems  with  k  >  3  [65,  76,  138,  141]. 
However,  many  basic  problems  remain  unresolved;  for  example,  we  do  not  know  if  90%- 
approximating  Max  Cut  is  in  P  or  is  NP-hard.  Similarly,  given  a  satisfiable  3-CSP,  we  do 
not  know  if  satisfying  2/3  of  the  constraint-weight  is  in  P  or  is  NP-hard.  To  address  this, 
the  Unique  Games  Conjecture,  along  with  some  variants  of  it  called  d-to-1  Conjecture, 
were  proposed  by  Khot  [97]  in  2002. 

One  equivalent  statement  of  the  Unqiue  Games  Conjecture  (UGC)  [99]  is  about  the 
approximability  of  the  following  problem: 

Definition  1.2.1.  (T-Max  2-LINgj  We  are  given  a  system  of  linear  equations  with  variables 
{xd"=1  and  all  the  equations  are  of  the  simple  form  Xj  -  xj  =  Cij(modq)  with  the  integer 
coefficient  0  <  c tj  <  q  -  1.  The  goal  is  to  assign  each  x;  some  value  in  {0, 1, . . . ,  q  -  1}  such 
that  the  maximum  number  of  equations  are  satisfied. 

UGC  states  that  it  is  extremely  hard  to  approximate  the  T-Max  2-LINg  problem  in 
the  following  sense: 

Conjecture  1.2.2.  (UGC)  For  any  e  >  0,  there  exists  large  enough  q  such  that  for  UMAX 
2-LINg  instance,  even  there  is  an  assignment  that  satisfies  1-  e  fraction  of  the  equations, 
it  is  NP -hard  to  find  an  assignment  that  satisfies  more  than  e  fraction  of  the  equations. 

Assuming  the  UGC,  people  have  proved  many  optimal  hardness  of  approximation  re¬ 
sults  such  as  those  results  for  Max  Cut  [99,  102,  120]  and  Max-2Sat  [13,  14]  and  a  lot  of 
other  problems  [43,  67,  103].  In  a  powerful  work  by  Raghvendra  [125],  he  obtained  a  very 
general  result  that  for  almost  any  CSP,  the  optimal  approximation  is  achieved  by  certain 
generic  SDP  algorithms. 

This  thesis  includes  work  that  initializes  this  line  of  research  as  well  as  work  that 
reflects  the  latest  development  of  the  area.  In  particular,  we  study  the  approximability  of 
serval  important  CSPs:  Max  Cut  and  Max  3-CSP  and  VERTEX-PRICING.  In  addition,  we 
will  study  the  approximability  as  well  as  the  SDP  approximation  for  serval  variants  of  the 
T-Max  2-LINg  problem. 

1.2.2  Computational  Learning 

In  addition  to  the  CSPs,  we  also  study  NP-hard  optimization  problems  from  Computa¬ 
tional  Learning  theory,  a  branch  of  theoretical  computer  science  that  studies  how  to  effi¬ 
ciently  infer  an  unknown  target  function  from  examples  under  certain  distributions.  For 
example,  the  target  function  can  be  “whether  it  is  going  to  rain  tomorrow  ?”  and  the  input 
to  the  target  function  could  be  the  measurement  of  different  physical  conditions  of  today 
such  as  temperature,  humidity,  and  wind  speed,  etc.  The  learning  algorithm  has  an  ac- 
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cess  to  a  set  of  labelled  examples  (e.g.,  the  measurements  of  a  certain  day  as  well  as  the 
weather  of  the  day  after)  and  the  goal  is  to  infer  the  target  function  with  high  accuracy 

We  usually  assume  that  the  target  function  has  certain  “simple”  structure,  as  other¬ 
wise  we  have  no  way  of  inferring  the  function  on  any  unseen  examples.  Some  examples 
of  classes  of  simple  functions  include:  monomials  (conjunctions),  decision  lists,  majority 
functions,  halfspaces,  low  degree  polynomial  threshold  functions  (PTFs),  small  size  deci¬ 
sion  trees,  DNF,  CNF,  Neural  Networks,  et  al.  . 

In  learning  theory,  researchers  are  mainly  interested  in  whether  these  simple  function 
classes  are  learnable.  The  learnability  of  a  function  class  is  defined  by  whether  we  can 
use  a  small  amount  of  labelled  examples  and  computation  time  to  find  a  function  which 
has  a  good  agreement  with  the  target  function  in  that  function  class.  Such  a  model  is 
formalized  as  the  PAC  learning  model  by  Valiant  [140].  While  the  original  PAC  learning 
model  assumes  that  certain  simple  target  function  correctly  labels  all  the  data,  this  model 
has  been  generalized  by  Haussler  [78]  and  Kearns  [90]  to  address  the  case  when  there  is 
noise  in  the  labels  and  examples.  Under  their  model  (which  is  called  agnostic  learning 
model),  it  is  only  known  that  there  is  some  simple  function  that  has  correctly  labeled 
a  c  (say  c  =  0.95)  fraction  of  the  examples,  the  goal  of  the  learning  to  come  up  with  a 
hypothesis  with  accuracy  being  close  c. 

All  these  learning  problems  can  be  viewed  as  an  optimization  task  as  we  are  given  a 
set  of  labelled  examples  and  our  goal  is  to  find  a  hypothesis  with  maximum  prediction 
accuracy.  For  many  important  concept  classes,  finding  the  optimal  hypothesis,  especially 
when  there  is  noise,  is  NP-hard.  A  good  learning  algorithm  usually  returns  a  hypothesis 
that  approximate  the  optimal  one  well.  In  the  thesis,  we  are  particular  interested  in  the 
learnability  (approximability)  of  three  common  function  classes:  monomials,  halfspaces 
and  polynomial  threshold  functions  under  the  agnostic  learning  model. 

Comparison  between  Learning  and  Constraint  Satisfaction  Problems  A  learn¬ 
ing  problem  can  also  be  viewed  as  a  CSP:  each  of  the  example  is  a  constraint  and  the  goal 
is  to  find  a  hypothesis,  specified  by  a  set  of  variables,  that  has  the  maximum  agreement 
with  all  the  examples. 

Although  the  learning  problems  is  a  special  CSP,  in  this  thesis  we  discuss  the  learning 
problems  and  other  CSPs  seperately  for  the  following  two  reasons:  i)  the  CSPs  (except 
for  the  learning  problems)  in  this  thesis  all  have  “local”  constants  ;  i.e.,  each  constraint 
involves  constant  number  of  variables,  while  the  constraints  in  the  learning  problems  are 
“global’,  which  involve  many  variables,  ii)  The  techniques  of  proving  upper  and  lower 
bounds  for  these  two  classes  of  problems  are  relatively  independent. 


1.3  Organization  and  Summary 

1.3.1  Organization 

While  the  rest  of  the  thesis  spans  a  variety  of  different  problems,  they  are  all  united  by  the 
theme  on  understanding  the  approximability  or  inapproximability  of  NP-hard  optimiza- 
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tion  Problems  in  Learning  and  CSPs.  The  thesis  are  organized  as  follows: 

In  Chapter  2,  we  define  the  problems  to  study  in  thesis  as  well  as  some  relevant  back¬ 
ground  knowledge.  In  Chapter  3,  we  define  the  mathematical  tools  used  throughout  the 
thesis.  The  remaining  of  the  thesis  divides  into  two  relatively  independent  parts. 

In  Part  II,  our  work  is  mainly  on  understanding  the  approximability  of  various  CSPs  as 
well  as  the  SDP  algorithms  for  them:  we  study  the  problem  of  Max  Cut  in  Chapter  4,  Max 
3-CSP  in  Chapter  5,  a  generalization  of  T-Max  2-LINg  into  integer  domain  in  Chapter  7 
the  SDP  formulation  of  several  variants  of  T-Max  2-LINg  in  Chapter  6  and  the  problem 
of  vertex  pricing  in  Chapter  8. 

In  Part  III,  Our  work  is  mainly  to  prove  that  several  learning  tasks  are  inherently 
hard  to  approximate;  i.e.,  there  is  no  better-than-trivial  algorithm  for  the  problem.  In 
Chapter  9,  we  study  the  learnability  of  monomials  under  the  agnostic  learning  model.  In 
Chapter  10,  we  study  the  learnability  of  polynomial  threshold  functions  (PTFs)  under  the 
same  model. 

1.3.2  Summary  of  Thesis  Contributions 

We  summarize  the  main  contributions  of  the  thesis  as  follows: 

•  For  Part  II: 

■  In  Chapter  4,  we  give  the  a  complete  characterization  of  the  approximability 
of  the  Max  Cut  problem  assuming  the  UGC.  In  particular,  we  can  answer  the 
following  question:  given  a  Max  Cut  instance  of  optimum  value  c,  what  is  the 
best  polynomial  time  approximation  guarantee  we  can  achieve.  To  obtain  such 
a  result,  we  show  that  certain  RPR2  SDP  rounding  algorithm  [50]  is  the  optimal 
polynomial  time  algorithm  for  MAX  CUT.  In  addition,  we  precisely  determine 
the  SDP  gap,  which  is  a  important  geometric  property  of  SDP,  for  the  Max  Cut 
problem. 

■  In  Chapter  5,  we  study  the  approximability  of  satisfiable  Max  3-CSP;  i.e., 
given  a  3-CSP  such  that  there  exists  a  perfect  assignment  satisfying  all  the 
constraint,  what  is  the  best  approximation  guarantee  s  we  can  get?  The  op¬ 
timal  approximation  ratio  of  such  a  problem  is  also  corresponds  to  a  funda¬ 
mental  open  problem  in  the  area  of  PCP:  What  is  the  smallest  s  such  that 
NP  c  naPCPi)S[0(logre),  3]? 

The  previous  best  upper  bound  and  lower  bound  for  s  are  20/27  +  e  by  Khot  and 
Saket  [104]  and  5/8  by  Zwick  [141].  In  this  work  we  close  the  gap  assuming 
Khot’s  d-to-1  Conjecture.  Formally,  we  prove  that  if  Khot’s  d-to-1  Conjecture 
holds  for  any  finite  constant  integer  d,  then  the  optimal  approximation  for  sat¬ 
isfiable  Max  3-CSP  is  indeed  5/8. 

■  In  Chapter  6  we  present  SDP  gap  instances  for  three  variants  of  the  UNIQUE- 
Games:  (i)  2-to-l  Label-Cover;  (ii)  2-to-2  Label-Cover;  (iii)  a-constraint 
Label-Cover.  Compared  with  the  existing  Unique-Games  SDP  instance, 
the  difference  is  that  all  of  our  SDP  gap  instances  have  perfect  SDP  solutions. 
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For  alphabet  size  K,  the  optimal  solutions  have  value:  (i)  0(1/ ^logK);  (ii) 
0(l/logiO;  (iii)  0(l/y/logA).  Prior  to  this  work,  there  were  no  known  SDP 
gap  instances  for  any  of  these  problems  with  perfect  SDP  value  and  integral 
optimum  tending  to  0. 

■  In  Chapter  7,  we  study  the  hardness  of  solving  integer  linear  systems  of  which 
each  equation  contains  at  most  two  variables.  As  is  mentioned,  the  UGC  is 
equivalent  to  the  following  statement:  given  a  linear  system  with  variables 
such  as  Xi  -Xj  =  Cij  mod  q,  it  is  NP-hard  to  find  a  e-good  solution  to  the  system 
even  if  we  know  that  there  is  an  assignment  that  satisfies  1  -  e  fraction  of  the 
equations.  It  is  natural  to  ask  whether  such  a  linear  system  is  still  hard  when 
equations  are  evaluated  over  integers.  Assuming  the  UGC,  we  prove  that  such 
a  hardness  still  holds  for  equations  over  integers  (or  even  real  numbers). 

■  In  Chapter  8,  we  consider  the  problem  of  pricing  n  items  under  an  unlimited 
supply  with  single  minded  buyers,  each  of  which  is  interested  in  at  most  k  of 
the  items  .  The  meaning  of  "single  minded"  is  that  each  buyer  will  either  buy 
k  of  the  items  if  the  overall  cost  is  within  their  budget  or  they  will  buy  none 
of  them.  The  goal  is  to  price  each  item  with  profit  margin  pi,P2,---,Pn  so  as  to 
maximize  the  overall  profit.  There  is  an  OU (-approximation  algorithm  when 
the  price  on  each  item  must  be  above  its  margin  cost;  i.e.,  each  pt  >  0.  [26] 

We  investigate  the  above  problem  when  the  seller  is  allowed  to  price  some  of 
the  items  below  their  margin  cost.  It  was  shown  that  by  pricing  some  of  the 
items  below  cost,  the  seller  could  possibly  increase  the  maximum  profit  by 
(logre)  times  [26,  56].  These  items  sold  at  low  prices  to  stimulate  other  prof¬ 
itable  sales  are  usually  called  as  "loss  leader.  It  is  unclear  what  kind  of  ap¬ 
proximation  guarantees  are  achievable  when  some  of  the  items  can  be  priced 
below  cost.  Understanding  this  question  is  posed  as  an  open  problem  by  Blum 
and  Balcan  [26].  We  give  a  strong  negative  result  for  the  problem  of  pricing 
loss  leaders.  We  prove  that  assuming  the  Unique  Games  Conjecture,  there  is 
no  constant  approximation  algorithm  for  item  pricing  with  prices  below  cost 
allowed  even  when  each  customer  is  interested  in  at  most  3  items. 
Conceptually,  our  result  indicates  that  although  it  is  possible  to  make  more 
money  by  selling  some  items  below  their  margin  cost,  it  can  be  computationally 
intractable  to  do  so. 

•  For  Part  III: 

■  In  Chapter  9,  We  prove  the  following  strong  hardness  result  for  learning  mono¬ 
mials:  given  a  distribution  of  labeled  examples  of  binary  inputs  such  that  there 
exists  a  monomial  (conjunction)  consistent  with  (1  -  e)  of  the  examples,  it  is 
NP-hard  to  find  a  halfspace  that  is  correct  on  (1/2  +  e)  of  the  examples,  for  ar¬ 
bitrary  constants  e  >  0.  In  learning  theory  terms,  weak  agnostic  learning  of 
monomials  is  hard,  even  if  one  is  allowed  to  output  a  hypothesis  from  the  much 
bigger  concept  class  of  halfspaces.  As  immediate  corollaries  of  our  result  we 
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show  that  weak  learning  noisy  decision  lists  and  majorities  are  NP-hard.  There 
are  a  large  classes  of  learning  algorithms  that  use  halfspaces  as  their  hypoth¬ 
esis  such  as  SVM,  Perceptron,  Logistic  Regression,  et  al.  Our  result  rules  out 
the  possibility  that  any  of  these  algorithms  can  be  used  to  a  learn  the  function 
class  of  monomials  with  noise. 

■  In  Chapter  10,  we  prove  two  hardness  results  for  the  problem  of  agnostic  learn¬ 
ing  low  degree  polynomial  threshold  functions  (PTFs):  for  any  constants  d  > 
l,e  >  0, 

-  Assuming  the  UGC,  it  is  NP-hard  to  find  a  degree-d  PTF  that  is  consistent 
with  (|  +  e)  fraction  of  a  given  set  of  labeled  examples  in  [Rre  x  {-1,1},  even 
if  there  exists  a  degree-d  PTF  that  is  consistent  with  a  1  -  e  fraction  of  the 
examples. 

-  It  is  NP-hard  to  find  a  degree-2  PTF  that  is  consistent  with  ( |  +  e)  frac¬ 
tion  of  a  given  set  of  labeled  examples  in  Un  x  {-1,1},  even  if  there  exists 
a  halfspace  (degree-1  PTF)  that  is  consistent  with  al-e  fraction  of  the 
examples. 

These  results  immediately  imply  the  following  hardness  of  learning  results:  i) 
Assuming  the  UGC,  there  is  no  better-than-trivial  proper  learning  algorithm 
that  agnostically  learns  degree  d  PTFs  under  arbitrary  distributions;  ii)  There 
is  no  better-than-trivial  learning  algorithm  that  outputs  degree  2  PTFs  and  ag¬ 
nostically  learns  halfspaces  (i.e.,  degree  1  PTFs)  under  arbitrary  distributions. 
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Chapter  2 
Background 
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In  this  chapter,  we  formally  define  the  Constraint  Satisfaction  Problems  and  Learning 
Problems  studied  in  the  rest  of  this  thesis.  Also,  we  lay  out  the  framework  under  which  we 
analyze  the  approximation  algorithms  and  in  particular  those  based  on  SDP  In  addition, 
we  formally  define  the  UGC  and  several  variants  of  it  based  on  which  we  derive  many  of 
the  results.  Last  we  introduce  a  gadget  called  Dictator  Test  and  explain  its  relationship 
with  hardness  of  approximation  results  for  learning  and  CSPs. 


2.1  Notations 

First  we  define  the  symbols  with  their  meaning  used  throughout  the  thesis. 


Symbol  : 

Meaning 

R 

Real  numbers 

N 

Natural  numbers 

Z 

Integer  number 

Bn 

{xeir  :||x||<l}. 

Sn-1 

{xeR":  ||x||  =  l} 

Vector:  For  vector  x  e  R"  and  i  e  [n\,  we  use  Xj  to  denote  its  i-th  coordinate  and  write 
x  =  (xi,X2,.  •  .xn).  For  any  S  c  [«,],  we  use  xg  to  denote  the  collection  of  coordinates  in  set 
S. 


2.2  Approximation  and  Hardness  of  Approximation 

Given  an  NP-hard  problem  instance  G  and  suppose  the  problem  is  a  maximization  prob¬ 
lem.  Let  us  fix  the  following  notations:  we  denote  optimum  value  of  the  problem  to  be 
Opt(G);  for  a  polynomial-time  algorithm  A  on  the  problem  we  use  Alg^(G)  to  denote  the 
value  output  by  A  on  G. 

The  traditional  way  to  measure  the  quality  of  an  approximation  algorithm  is  to  look  at 
the  ratio: 

Definition  2.2.1.  ( Approximation  ratio)  We  call  a  algorithm  A  a-approximation  if  for  every 
instance  G  of  the  problem, 

Alg^(G) 

Opt(G) 

Correspondingly,  we  can  define  the  hardness  of  approximation  ratio: 

Definition  2.2.2.  (Hardness  of  Approximation  ratio)  We  call  a  problem  a-hard  to  approx¬ 
imate  if  there  is  no  polynomial  time  algorithm  with  better  than  a-approximation  unless 

P  =  NP. 

The  notions  of  approximation  and  hardness  of  approximation  as  ratios  have  some  un¬ 
satisfactory  aspects  though.  Instances  with  different  optimum  value  can  be  of  very  differ¬ 
ent  hardness  of  approximation  ratio.  Let  us  use  the  problem  of  solving  linear  systems  over 
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real  variables  as  an  example.  We  know  that  when  a  linear  system  has  a  solution  that  sat¬ 
isfies  all  the  equations,  we  can  efficiently  solve  the  problem  exactly  by  Gaussian  Elimina¬ 
tion.  However,  if  we  only  know  that  there  is  a  solution  that  satisfies  99%  of  the  equations, 
then  it  is  known  to  be  NP-hard  to  recover  a  solution  that  satisfies  even  1%  of  the  equa¬ 
tions  [68].  Another  example  is  the  Max  Cut  problem.  Goemans-Williamson  (GW)  [59] 
algorithm  has  a  guarantee  that  this  ratio  is  always  at  least  .878.  However  this  guarantee 
is  not  very  good  for  graphs  G  with  only  moderately  large  maximum  cuts.  For  example, 
if  Opt(G)  =  .55,  which  means  the  optimum  assignment  satisfies  .55  fraction  of  the  con¬ 
straints,  then  the  GW  algorithm  may  [4]  only  find  a  solution  with  value  .878  •  .55  <  .49, 
which  is  worse  than  the  trivial  one  (1/2).  On  the  other  hand,  Goemans  and  Williamson 
showed  [59]  that  when  Opt(G)  =  .95,  their  algorithm  finds  a  solution  with  value  at  least 
.90,  which  is  significantly  better  than  .878  •  0.95. 

We  think  it  is  essential  to  measure  the  quality  of  approximation  and  hardness  of  ap¬ 
proximation  not  with  a  single  ratio  but  with  a  curve.  Let  us  first  assume  that  we  have  a 
maximization  problem  2?  with  optimum  value  in  the  range  of  [0, 1]. 

Definition  2.2.3.  We  say  that  an  algorithm  A  achieves  approximation  curve  ApxA  :  [0, 1]  — * 
[0, 1]  for  problem  2?  if 


AlgA(G)  >  ApxA(Opt(G))  for  all  instance  G. 

Following  definition  is  used  to  characterize  the  approximation  guarantee  at  a  particu¬ 
lar  optimum  value  of  c. 

Definition  2.2.4.  Assume  2?  is  an  optimization  problem  and  sd  is  an  algorithm  for  it. 
If  any  instance  G  with  Opt(G)  >  c,  ApxA(G)  >  s,  then  we  say  that  algorithm  si  (c,s)- 
approximate  the  problem  & 

Correspondingly,  we  can  define  the  hardness  for  (c,s)-approximation;  usually  we  prove 
such  a  claim  by  showing  the  NP-hardness  of  the  following  decision  problem: 

Definition  2.2.5.  For  a  optimization  problem  2P,  we  use  2?(c,s)  to  denote  the  problem  of 
the  following:  given  a  instance  G  of  2?  and  distinguish  the  following  two  cases: 

1.  Opt(G)  >  c; 

2.  Opt(G)  <  s. 

Essentially,  if  ^(c,s)  is  NP-hard,  then  it  is  NP-hard  to  (c,s)-approximate  2?.  This  is 
because  if  there  is  a  polynomial  time  algorithm  that  (c,s)-approximate  2?,  we  can  run  the 
algorithm  on  instances  of  (d  and  output  "Opt(G)  >  c"  if  the  algorithm  outputs  value  above 

s. 


2.3  CSPs  and  SDP 

2.3.1  CSPs 

A  Constraint  Satisfaction  Problem  (CSP)  involves  a  system  of  constraints  over  variables 
{Vi)*=1.  A  “&-CSP”  is  a  system  of  constraints  in  which  each  constraint  involves  at  most  k 
of  the  variables.  We  also  assume  each  constraint  has  a  nonnegative  weight,  with  the  sum 
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weight: 

constraint: 

1/4 

V\  A  -1O3  A  V4 

1/4 

IF  V3  THEN  04  ELSE  -105 

1/2 

V2  ^  v5 

Figure  2.1:  3-CSP 


of  all  weights  being  1.  Given  a  k-CSP,  the  natural  algorithmic  task,  called  “Max  k-CSP”, 
is  to  find  an  assignment  to  the  variables  such  that  the  total  weight  of  satisfied  constraints 
is  as  large  as  possible.  We  write  “Opt”  to  denote  the  weight  satisfied  by  the  best  possible 
assignment.  We  also  say  that  a  CSP  is  “satisfiable”  if  Opt  =  1.  Figure  2.3.1  is  an  example 
of  3-CSPs. 

In  a  &-CSP,  each  constraint  in  a  &-CSP  is  of  a  certain  “type”;  more  precisely,  it  is  a 
certain  predicate  with  arity  at  most  k  over  the  variables.  If  we  specialize  Max-^CSP  by 
restricting  the  type  of  constraints  allowed,  we  get  some  of  the  most  canonical  NP  opti¬ 
mization  problems.  For  the  special  case  when  a  CSP  is  over  Boolean  variable  v\,...vn.  Let 
us  use  1 1  to  denote  the  literal  which  can  represent  either  Vi  or  -i ig.  Some  of  the  important 
classes  of  Boolean  CSPs  are  listed  here: 

•  Max-2Sat:  with  predicate  v  lj; 

•  Max-3Sat:  with  predicate  ltvljV  lj ; 

•  Max-2Lin:  with  predicates  Vi  ©  vj  ,  -> (ig  ©  Vj); 

•  Max  Cut:  with  predicate  Vi  vj. 

•  Max-3CSP:  with  all  the  possible  3-bit  predicates  P(vi,vj, Vk)  :  {0,  l}3  — ►  {0, 1}. 

•  Max-3CSP:  with  all  the  possible  k-bit  predicates  P(v i, V2,---, Vk) :  {0, l}k  — ►  {0, 1}. 

We  also  study  some  less  familiar  3-CSPs  in  the  thesis. 

•  Max  NTW:  with  predicate  NTW(Zi,Z2,^3),  where  NTW  is  the  3-arity  predicate  that  eval¬ 
uate  truth  if  and  only  if  0, 1  or  3  of  its  input  is  True;  i.e.  "Not  Two  True"; 

•  MAX  NAE:  with  predicate  NAE(Zi,Z2,^3)  =  -I(G  =  l%  =  I3). 

Further,  we  also  study  CSPs  over  larger  domain  (other  than  Boolean  value)  such  as  [g] 
or  even  Z  and  R.  Following  are  definitions  of  such  CSPs  that  will  be  discussed  in  the  rest 
of  the  thesis. 

•  Max  2-LINg:  Vi  e  \q],  with  predicates  av[  +  bvj  =  c  mod  q  for  a,b,c  e  [g]; 

•  Max  2-LlNz:  1 H  £  Z,  with  predicates  av[  +  bvj  -  c  for  a,b,c  e  Z; 

•  Max  2-LiNr:  Vi  e  IR,  with  predicates  avt  +  bvj  -  c  for  a,b,c  e  R; 

•  Max  3-LINg:  ig  e  [g],  with  predicates  avi  +  bvj  +  cvk  =  d  mod  g  for  a,  b,c,d  e  [g]. 

•  Max  3-LlNz:  Vi  e  Z,  with  predicates  avt  +  bvj  +  cvk  =  d  for  a,b,c,d  e  Z  ; 

•  Max  3-LiN[r:  Vi  e  R,  with  predicates  avt  +  bvj  +  cvk  =  d  for  a,b,c,d  e  R.  . 

•  T-Max  2-LiNz,r-MAX  2-LiN|R,r-MAX  2-LINg:  Max  2-Linz,Max  2-LiN|r,Max  2- 
LINg  with  the  additional  constraints  that  each  equation  has  the  special  form  Vi~vj  = 
a,  evaluated  in  the  corresponding  domain. 
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Each  constraint  in  £-CSPs  can  be  viewed  as  functions  of  the  form:  f  :  — »  {0,1}.  A 
assignment  satisfy  the  constraint  f  if  f’s  value  is  1.  We  can  further  relax  the  definition 
of  CSPs  by  allowing  constraints  to  be  more  generalized  payoff  functions  that  takes  real 
values  (other  than  {0, 1})  .  The  goal  of  the  optimization  task  is  to  find  an  assignment  to 
maximize  the  weighted  sum  of  the  payoff  on  all  of  the  constraints.  In  this  thesis,  we  will 
also  study  a  CSP  called  VERTEX-PRICING  with  the  following  generalized  payoff  function: 

•  VERTEX-PRICING^:  variables  vi,V2,...Vk  £  IK  and  the  constraint  is  of  the  form 

fb(vi,...,vk)=  l(£yf  <  b)-(Y^Vi) 

for  some  positive  constant  b  e  K+. 

We  will  explain  the  problem  in  more  details  in  Chapter  8. 


2.3.2  SDP  Gap 

Most  of  the  best  approximation  guarantees  for  CSPs  currently  known  are  achieved  by 
algorithms  using  Semidefinite  Programming  (SDP).  Generally  speaking,  SDP  based  algo¬ 
rithm  involves  two  parts:  relaxation  of  the  original  problem  in  to  a  SDP  and  rounding  the 
solution  of  the  SDP  to  an  integer  solution. 


For  the  purpose  of  exposition,  let  us  use  the  Max  Cut  problem  as  an  example.  Suppose 
we  have  a  Max  Cut  instance  G  and  it  has  input  on  Boolean  variables  x\,...xn  e  (-1, 1}  and 
a  set  constraints  X;  ^  Xj  with  positive  weight  Wij. 

Essentially,  the  optimum  value  of  G  is  the  maximum  of  the  following  integer  program¬ 
ming  problem: 


max  Y 

XiE{- 1,1}  ^ 


wu 


1-XiXj 


2 


Solving  the  above  optimization  problem  is  NP-hard  as  it  is  equivalent  to  the  Max  Cut 
problem.  The  SDP  relaxation  of  Max  Cut  replace  each  x;  with  vector  variable  Vi  (i.e. 
vt  £  Kn,  |i>j|  =  1)  and  replace  the  product  of  two  integer  by  the  inner  product  of  two  vectors. 
The  following  relaxed  optimization  problem  is  the  SDP  relaxation  of  Max  Cut  and  we  call 
its  optimum  Sdp(G) : 


I>C 

Vi^Sn-l  j  i 


1-Vj-Vj 

2 


Apparently  Sdp(G)  >  Opt(G)  as  we  can  always  set  each  Vi  to  be  one  dimensional  unit 
vector  (i.e.,  1  or  -1)  to  achieve  the  integral  optimum.  The  utility  of  this  relaxation  is  that 
we  can  actually  find  an  essentially  optimal  in  polynomial  time  [59],  Then  after  solving 
the  relaxed  optimization  problems,  we  can  figure  out  a  set  of  x;  from  the  vector  Vi.  For 
example,  after  getting  a  set  of  vectors  vi,V2,...,vn,  the  famous  Goemans-Williamson  algo¬ 
rithm  uses  a  random  hyperplane  to  cut  all  the  vectors  in  to  two  parts  and  this  naturally 
induces  an  assignment  of  the  Xj.  Another  interpretation  of  the  GW  algorithm  is  that  we 
first  randomly  picking  a  vector  r  and  then  we  set  x;  =  sgn(r  •  Vi).  A  simple  analysis  on 
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above  rounding  scheme  shows  that  AlgGW(G)  >  0.878 -Sdp(G)  >  0.878 -Opt(G)  and  thus  the 
Goemans-Williamson  Algorithm  achieves  an  0.878-approximation. 

For  other  CSPs,  their  SDP  algorithms  all  have  a  similar  framework:  i)  formulate  the 
original  problem  as  an  integer  programming;  ii)  relax  and  solve  the  corresponding  SDP; 
iii)  Round  the  SDP  solution  to  an  integer  solution. 

In  evaluating  SDP  algorithms,  we  would  like  to  compare  the  algorithm  output  to  the 
optimum  values.  However  doing  this  directly  is  difficult  —  roughly  because  Max-CSPs 
are  usually  hard,  and  therefore  we  do  not  analytically  have  access  to  the  optimum.  The 
approximation  guarantees  of  SDP-based  algorithms  are  actually  based  on  comparing  the 
value  of  the  algorithm  output  to  the  SDP  value : 

Definition  2.3.1.  Given  a  SDP  algorithm  A,  we  use  Sdp(G)  to  denote  the  corresponding 
SDP  value.  We  say  that  SDP  algorithm  A  achieves  SDP-approximation  curve  SdpApxA  : 

[0,1] -[0,1]  if 

AlgA(G)  >  SdpApxA(Sdp(G))  for  all  G. 

There  is  an  obvious  barrier  to  how  good  SDP-approximation  guarantees  can  be:  If  there 
exists  a  instance  G  with  Sdp(G)  >  c  and  Opt(G)  <  s  then  of  course  no  algorithm  could  have 
an  SDP-approximation  curve  SdpApx  with  SdpApx(c)  >  s.  The  SDP  gap  is  defined  as 
follows: 

Definition  2.3.2.  For  0  <  s  <  c  <  1,  we  call  the  pair  ( c,s )  an  SDP  gap  if  there  exists  a 
instance  G  with  Sdp(G)  >  c  and  Opt(G)  <  s.  We  define  the  SDP  gap  curve  by 

GapSDP(c)  =  inf{s  :  (c,s)  is  an  SDP  gap}. 

In  addition,  the  SDP  gap  gives  a  measure  of  how  close  the  SDP  is  to  the  original  integer 
programming  problem. 


2.4  Learning  Theory 


2.4.1  Concept  Classes 

Computational  Learning  Theory  establishes  the  theoretical  framework  of  how  can  we  in¬ 
fer  an  unknown  target  function  from  examples  under  certain  distributions.  We  usually 
assume  that  the  target  function  is  from  some  simple  concept  class.  Let  us  define  concept 
class  as  follows  (assuming  we  only  consider  binary  examples  and  labels) 

Definition  2.4.1.  (Concept  Class)  A  concept  class  is  a  class  of  functions  on  f  :  {0,1}"  — 

{-1,1}. 

Here  is  a  list  of  concept  classes  studied  in  the  thesis. 

Definition  2.4.2.  (monomials)  Suppose  the  input  to  the  function  is  x  e  {0, 1}",  suppose  U  is 
the  literal  that  can  represent  either  xt  or  -i xp  A  monomial  is  the  conjunction  on  a  subset  of 
literals  which  can  be  represented  as: 

kh- 

ieS 

for  some  S  Q  [n\. 
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Definition  2.4.3.  (decision  lists)  A  decision  list  f  over  the  Boolean  variables  x  e  {0, 1}”  is 
represented  by  a  list  of  variable  pairs  di,bi),(l2,b2),---,(h,bk )  and  bk+i  where  each  l[  is  a 
literal  (being  either  xt  or  -i xf)  and  each  bi  is  either  -1  or  1.  Given  any  x  e  {0,  1}”,  the  value 
of  fix)  is  bi  if  i  is  the  smallest  index  such  that  li  is  made  true  by  x;  if  no  li  is  true  then 
l(x)  =  bk+ 1- 

Definition  2.4.4.  (halfspaces)  Suppose  the  input  is  x  e  {0, 1}"  (or  A  halfspace  function 
fix) :  {0,  1}”  — » {-1, 1}  is  the  sgn  of  the  weighted  sum  of  all  the  X[  subtracted  by  a  threshold: 

sgn  iJ^WiXi-6)1. 


Here  w\,...,wn,6  e  R. 

Definition  2.4.5.  (degree  d  PTFs)  For  positive  integer  d,  we  call  a  function  fix) :  {0, 1}"  — < ►  R 
(or  IR”  — *■  IR)  a  degree  d  polynomial  function  if  it  is  of  the  following  polynomial  expansion 
form: 

E  csU*i- 

multiset  S'~[n\,\S\<d  ieS 

Here  each  cs  £  IK  is  the  coefficient  of  the  polynomial.  A  degree  d  polynomial  threshold 
function  is  of  the  form  sgn(/(x))  where  fix)  is  a  degree  d  polynomial  function. 

A  relationship  among  all  these  concept  classes  is  that: 

monomials  Q  decision  lists  Q  halfspaces  Q  degree  d  PTFs. 

2.4.2  Learning  Models 

In  learning  theory,  researchers  study  whether  these  common  concept  classes  are  learn- 
able.  The  learnability  of  a  concept  class  is  defined  under  the  PAC  learning  model  by 
Valiant  [140]. 

Definition  2.4.6.  (PAC  Learning)  We  say  an  algorithm  $4  efficiently  learns  a  Boolean  func¬ 
tion  class  3P  if  the  following  is  true  for  any  8,e  >  0  and  distribution  D  on  {0,  1}”  and  f  in 
3:  Suppose  s4  has  an  oracle  access  to  example-label  pairs  ix,fix))  for  x  sampled  from 
distribution  D,  it  will  output  some  hypothesis  h  in  certain  concept  class  JO  such  that  with 
probability  1-8,  Pr  (h(x)  =  fix))  >l-e  with  running  time  poly  (1/e,  1/8,  n).  We  call  the 
learning  algorithm  proper  if  &  -  JO. 

While  the  original  PAC  learning  model  assumes  that  some  function  f  £  S’  perfectly 
labels  all  the  data,  this  model  has  been  generalized  by  Haussler  [79]  and  Kearns  [90]  to 
address  the  noise.  In  addition,  the  new  models  has  extended  to  functions  over  real  value 
input:  f  :Mn  -+  [-1, 1}.  Under  their  model  (which  is  called  the  agnostic  learning  model),  it 
is  only  known  that  there  is  some  function  in  a  particular  concept  class  3  that  has  correctly 
labeled  a  c  fraction  of  the  examples,  the  goal  of  the  learning  to  come  up  with  a  hypothesis 
with  accuracy  being  close  to  c. 

Definition  2.4.7.  (Agnostic  Learning)  We  say  an  algorithm  $4  agnostically  learns  a  con¬ 
cept  class  3F  if  the  following  is  true  for  any  8,e  >  0  and  distribution  D  on  {0,1}”  (or  even 
R”):  Suppose  s4  has  an  oracle  access  to  example-label  pairs  (x,lx)  for  x  sampled  from 

1in  this  thesis,  we  use  the  convention  that  sgn(x)  is  1  for  x  >  0  and  -1  for  x  <  0. 
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distribution  D  and  suppose  the  best  hypothesis  in  S  has  an  accuracy  at  least  c;  i.e., 
maxf£^Pr(f(x)  =  lx )  >  c,  then  si  will  output  some  hypothesis  h  e38  such  that  with  proba¬ 
bility  at  least  1-8,  Pr  (h(x)  =  lx)>c-e)  with  running  time  poly(l/5, 1/e,  n). 

The  agnostic  model  still  defines  learnability  as  whether  an  algorithm  can  find  a  hy¬ 
pothesis  that  has  almost  the  optimal  accuracy;  in  practice,  come  up  with  a  hypothesis 
with  any  non-trivial  (and  not  necessarily  optimal)  performance  would  still  be  useful.  It  is 
quite  natural  to  relax  the  agnostic  learning  model  to  address  this. 

Definition  2.4.8.  (( c,s)  Agnostic  Learning)  For  0  <  s  <  c  <  1,  we  say  an  algorithm  si  ag¬ 
nostically  (c,s)  learns  concept  class  38  if  the  following  is  true  for  any  8  >  0  and  distribution 
D  on  {0, 1}"  (or  even  R"/:  Suppose  si  has  an  oracle  access  to  example-label  pairs  ( x,lx )  for 
x  sampled  from  distribution  D  and  suppose  the  best  hypothesis  in  38  has  accuracy  at  least 
c;  i.e.,  max/ejirPrl/Xx)  =  lx)  >  c,  then  si  will  output  some  hypothesis  h  e  38  such  that  with 
probability  1-5,  Pr (h(x)  =  lx)>s-e)  with  running  time  poly(l/5, 1/e,  n). 

Uniform  convergence  results  in  Haussler’s  work  [78]  (and  see  also  [90])  implies  that 
for  most  common  simple  concept  class2,  learnability  of  38  by  outputting  hypothesis  in  ,38 
in  the  above  agnostic  model  is  equivalent  to  the  approximability  of  the  problem  of  finding 
hypothesis  in  ,38  that  has  the  same  agreement  rate  as  the  best  hypothesis  in  ^  on  the 
given  set  of  examples. 

We  use  38-38- MA  to  denote  the  optimization  problem  of  finding  an  optimal  function  in 
38  that  approximate  the  best  function  in  83  on  a  set  of  examples.  If  S  -  38,  we  just  write 
it  as  ^-MA.  We  also  define  the  following  decision  problem  to  characterize  its  approxima¬ 
bility. 

Definition  2.4.9.  For  0  <  s  <  c  <  1,  and  a  given  set  of  examples,  we  want  to  distinguish  the 
following  two  cases: 

1.  There  is  some  hypothesis  f  £  83  such  that  agrees  with  a  c  fraction  of  the  examples. 

2.  No  hypothesis  in  38  agrees  more  than  an  s  fraction  of  the  examples. 

We  call  above  decision  problem  38-38- MA  (c,s) 

Therefore,  by  the  uniform  convergence  results,  the  NP-hardness  of  the  above  problem 
suggests  the  NP-hardness  of  (c,s)  agnostically  learn  a  hypothesis  in  38  by  concept  class 
38.  And  when  38  -  38,  the  hardness  of  ^-MA  implies  the  hardness  of  proper  learning 
concept  class  & . 

In  the  thesis,  we  will  investigate  the  above  problems  for  some  natural  selection  of  38 
and  S .  We  are  mostly  interested  in  the  cases  when  c  =  l-o(l);  i.e.,  we  want  to  understand 
the  learnability  of  a  concept  class  38  knowing  that  there  is  indeed  some  hypothesis  in  it 
that  almost  correctly  labels  all  the  examples. 

For  notation  convenience,  we  use  the  following  short  name  for  the  concept  class  we 
have  defined: 

•  MON:  monomials; 

•  HS:  halfspaces; 

•  DL:  decision  lists; 

2The  requirement  for  the  uniform  convergence  results  to  hold  is  that  a  concept  class  should  have  polyno¬ 
mial  VC  dimension,  a  requirement  that  all  the  concept  classes  we  study  in  the  thesis  satisfy. 
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•  PTF^:  degree  cL  PTFs. 

We  study  the  approximability  of  MON-HS-MA  in  Chapter  9  and  MA-PTF^  in  Chap¬ 
ter  10. 

2.4.3  Related  Work 

A  number  of  hardness  results  for  proper  agnostic  learning  of  monomials,  decision  lists  and 
halfspaces  have  appeared  in  the  literature.  For  monomials,  MON-MA  was  shown  to  be 
NP-hard  by  Kearns  and  Li  [91].  The  hardness  of  approximating  the  problem  within  some 
constant  factor  (i.e.,  APX-hardness)  was  first  shown  by  Ben-David  et  al.  [20].  The  factor 
was  improved  to  58/59  by  Bshouty  and  Burroughs  [16].  Finally,  Feldman  showed  a  tight 
inapproximability  result  [51]  (see  also  [52]),  namely  that  MON-MA  (1-e,  1/2 +  e)  is  NP- 
hard.  Recently,  Khot  and  Saket  [105]  proved  a  similar  hardness  result  even  when  a  /-CNF 
is  allowed  as  output  hypothesis  for  an  arbitrary  constant  /  (a  /-CNF  is  the  conjunction 
of  several  clauses,  each  of  which  has  at  most  /  literals;  a  monomial  is  thus  a  1-CNF). 
The  Maximum  Agreement  problem  for  halfspaces  (HS-MA)  was  shown  to  be  NP-hard  to 
approximate  by  Amaldi  and  Kann  [5],  Ben-David  et  al.  [20],  and  Bshouty  and  Burroughs 
[16]  for  approximation  factors  |||,  |y|,  and  ||,  respectively.  An  optimal  inapproximability 
result  was  established  independently  by  Guruswami  and  Raghavendra  [68]  and  Feldman 
et  al.  [52]  showing  NP-hardness  of  HS-MA  (1  -  e,  1/2  +  e)  for  any  e  >  0.  The  reduction  in 
[52]  produced  examples  with  real-valued  coordinates,  whereas  the  proof  in  [68]  worked 
also  for  examples  drawn  from  the  Boolean  hypercube.  For  the  concept  class  of  decisions 
lists,  APX-hardness  of  its  Maximum  Agreement  problem  (Dl-MA)  was  shown  by  Bshouty 
and  Burroughs  [16].  As  for  the  concept  class  of  low  degree  PTFs,  its  hardness  of  knowing 
result  is  not  well  understood  before  our  work. 

A  number  of  hardness  of  approximation  results  are  also  known  for  the  symmetric  prob¬ 
lem  of  minimizing  disagreement  for  each  of  the  above  concept  classes  [7, 16,  51,  52,  80,  90]. 
Another  well-known  evidence  of  the  hardness  of  agnostic  learning  of  monomials  is  that 
even  a  non-proper  agnostic  learning  of  monomials  would  give  an  algorithm  for  learning 
DNF  —  a  major  open  problem  in  learning  theory  [109].  Further,  Kalai  et  al.  proved  that 
even  agnostic  learning  of  halfspaces  with  respect  to  the  uniform  distribution  implies  learn¬ 
ing  of  parities  with  random  classification  noise  —  another  long-standing  open  problem  in 
learning  theory  and  coding  [84]. 

On  the  algorithmic  side,  monomials,  decision  lists,  halfspaces  and  low  degree  PTFs 
are  all  known  to  be  PAC-learnable.  Monomials,  decision  lists  and  halfspaces  are  even 
known  to  be  efficiently  learnable  in  the  presence  of  more  benign  random  classification 
noise  [6,  22,  33,  89,  92],  Simple  online  algorithms  like  Perceptron  and  Winnow  learn  half- 
spaces  when  the  examples  can  be  separated  with  a  significant  margin  (as  is  the  case  if 
the  examples  are  consistent  with  a  monomial)  and  are  known  to  be  robust  to  a  very  mild 
amount  of  adversarial  noise  [12,  57,  58].  Kalai  et  al.  gave  the  first  non-trivial  algorithm 
for  agnostic  learning  monomials  in  time  2°^)  [84],  They  also  gave  a  breakthrough  result 
for  agnostic  learning  of  halfspaces  with  respect  to  the  uniform  distribution  on  the  hyper¬ 
cube  up  to  any  constant  accuracy  (and  analogous  results  for  a  number  of  other  settings). 
Their  algorithms  output  linear  thresholds  of  parities  as  hypotheses.  Very  recent  work  [42] 
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has  in  fact  given  efficient  agnostic  learning  algorithms  for  low-degree  PTFs  under  specific 
distributions  on  examples  such  as  Gaussian  distributions  or  the  uniform  distribution  over 
Boolean  Cube. 


2.5  Label-Cover  and  Khot’s  Conjectures 

In  this  section,  we  formally  state  the  UGC  which  leads  to  a  numerous  hardness  of  approx¬ 
imation  results.  To  begin  with,  let  us  first  define  the  Label-Cover  Problem,  of  which  the 
Unique-Games  is  a  special  case. 

Definition  2.5.1.  A  Label-Cover  instance  ££  is  defined  by  a  tuple  (U ,V ,E ,P ,i?i,i?2,Il). 
Here  U  and  V  are  the  two  vertex  sets  of  a  bipartite  graph  and  E  is  the  set  of  edges  between  U 
and  V.  P  is  an  explicitly  given  probability  distribution  on  E.  R  i  and  R 2  are  integers  with 
1<R\  <R  2.  n  is  a  collection  of  “projections”,  one  for  each  edge:  n  =  {ne  :  LR2]  — *•  LRi]  I  e  eE}. 
A  labeling  L  is  a  mapping  L  :U  -*[Ri],V  — ►  [i?2l  We  say  that  an  edge  e  =(u,v)  is  “satisfied” 
by  labeling  L  ifne(L(v))  =  L(u).  We  define: 

Opt(L)  =  max  Pre={u>vyP[ne(L(v))  =  L(u)]. 

all  labelling  L 

The  fundamental  inapproximability  theorem  of  Raz  [128]  is  the  following  statement  of 
the  hardness  of  approximating  the  LABEL-COVER  problem: 

Theorem  2.5.2.  There  exists  some  positive  constant  q  such  that  for  every  constant  e  >  0  for 
any  l/k11  <  e  and  Label-Cover  instances  with  alphabet  size  k  (or  above),  Label-Cover 
(l,e)  is  NP-hard. 

In  [97],  Khot  conjectured  that  several  restricted  forms  of  the  Label-Cover  problem 
are  also  NP-hard. 

Definition  2.5.3.  (d-to- 1  Label-Cover]  A  projection  n  :  [#2]  — *■  LR 1]  is  said  to  be  “d-to-1” 
if  for  each  element  i  e  [i?i]  we  have  1  <  |7r-1(i)|  <  d.  The  d-to-1  Label-Cover  is  the  special 
case  of  Label-Cover  in  which  each  projection  in  n  is  d-to-1. 

A  Unique  Games  instance  is  the  special  case  when  d  =  1  and  sometimes  it  is  also 
referred  as  the  Unique  Label-Cover.  The  UGC  is  that  it  is  NP-hard  to  distinguish  near 
satis fable  instance  from  instances  with  tiny  optimum  value. 

Conjecture  2.5.4.  (UGC)  For  every  constant  e  >  0  there  is  some  constant  k(e)  such  that  for 
Unique-Games,  with  label  size  greater  than  k(e),  Unique-Games  (1  -  e,e)  is  NP-hard. 

It  is  easy  to  see  that  T-Max  2-LINg  is  a  special  case  of  Unique  Gamses  with  alphabet 
size  q.  By  the  work  of  [99],  it  is  also  known  that  Unqiue  Games  is  equivalent  to  the 
following  statement: 

Conjecture  2.5.5.  (Equivalent  statement  of  UGC)  For  any  constant  e,  there  exists  large 
enough  q,  such  that  T-Max  2-LINg  (1  -e,e)  is  NP-hard. 

If  we  want  to  parameterized  the  soundness  by  the  size  of  alphabet,  following  statement 
is  equivalent  to  UGC  [99]. 

Conjecture  2.5.6.  For  any  constant  e,  there  exists  large  enough  q,  such  that  T-Max  2- 
LINg  (1  -  e,  1/q  2^ )  is  NP-hard. 
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It  should  be  note  that  when  a  UMAX  2-LINg  instance  has  optimum  value  is  1,  such 
a  problem  can  be  easily  solved  by  Gaussian  Elimination.  This  also  is  true  for  UNIQUE- 
Games.  In  comparison,  Label-Cover  is  NP-hard  to  e-approximate  even  when  it  has 
optimum  value  1.  Khot’s  d-to-1  Conjecture  address  the  above  difference  by  assuming  that 
when  d  >  2,  d-to-1  Label  Cover  is  also  hard  for  satisfiable  instance. 

Conjecture  2.5.7.  (d-to-1  Conjecture)  For  every  constant  e  >  0  there  is  some  constant  k(e ) 
such  that  for  d-to-1  Label-Cover  instances  5£  with  R2  >  k(e),  d-to-1  Label-Cover  (l,e) 
is  NP  hard. 

2.5.1  UGC  v.s.  <i-to-l  Conjecture 

Since  Unique-Games  does  not  have  perfect  completeness;  i.e.,  it  is  easy  when  opt  = 
1,  None  of  the  UGC-based  hardness  results  applies  to  the  satisfiable  Max-O  problems, 
i.e.,  the  (l,s)-approximability,  by  current  reduction  machineries.  In  comparison,  the  d- 
to-1  Conjecture  states  that  it  is  NP-hard  to  distinguish  whether  a  d-to-1  Label-Cover 
instances  is  satisfiable  or  far  from  satisfiable ;  it  can  be  easily  adapted  to  the  reduction 
that  address  the  approximability  of  satisfiable  instance.  The  first  application  of  the  d-to-1 
Conjecture  is  by  Dinur  et  al.  [43]  where  they  use  some  variant  of  the  2-to-l  Conjecture  to 
obtain  hardness  of  approximation  result  for  the  4-Coloring  problems.  The  reason  they  can 
not  use  UGC  is  because  assuming  UGC,  they  can  only  obtain  hardness  results  applies  to 
"almost  4-colorable"  graph.  There  has  also  been  several  other  works  that  use  the  d-to-1 
conjecture  to  derive  the  hardness  for  satisfiable  instance  [71,  120,  137], 

Assuming  the  correctness  of  d-to-1  conjecture,  we  present  a  (l,5/8+o(l))  hardness  for  3- 
CSP  that  appear  in  Chapter  5  in  this  thesis.  In  addition,  one  may  also  wonder  is  it  possible 
to  use  SDP  to  solve  satisfiable  d-to-1  Label-Cover  so  as  to  disprove  d-to-1  conjecture? 
We  make  some  partial  progress  on  understanding  the  SDP  gap  of  d-to-1  Label-Cover  in 
Chapter  6. 


2.6  Dictator  Testing 

In  this  section,  we  introduce  a  gadget  called  “Dictator  Testing”  which  is  strongly  motivated 
by  its  applications  to  proving  hardness-of-approximation  results  for  CSPs  and  Learning. 
Generally  speaking,  we  have  black-box  query  access  to  an  unknown  Boolean  function  f  : 
[0,  l}'1  —  [0, 1}  and  the  goal  is  to  test  the  extent  to  which  f  is  close  to  a  “dictator”  function; 
i.e.,  one  of  the  n  functions  of  the  form 

f{Xl,...,Xn)  =  Xi. 

Dictator  Testing  is  in  somewhat  different  form  for  learning  and  CSP  applications  and 
we  discuss  the  difference  in  the  following  two  sections. 

2.6.1  Dictator  Testing  for  CSPs 

A  “test”  is  a  randomized  algorithm  which  makes  a  very  small  number  of  queries  to  f 
and  then  either  “accepts”  or  “rejects”.  The  Dictator  Testing  problem  was  first  studied 
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by  Bellare,  Goldreich,  and  Sudan  [19],  with  hardness-of- approximation  for  CSPs  as  the 
motivation.  It  was  later  independently  introduced,  with  the  “dictator”  terminology,  by 
Parnas,  Ron,  and  Samorodnitsky  [122]. 

Definition  2.6.1.  A  Dictator  Test  has  completeness  at  least  c  if  all  n  dictator  functions  are 
accepted  with  probability  at  least  c.  We  say  a  Dictator  Test  has  perfect  completeness  if  it 
has  completeness  1. 

The  Dicator  Test  should  also  have  the  property  of  rejecting  functions  far  from  Dictator 
with  high  probability,  which  is  also  called  the  soundness  of  the  test.  Hastad  [75,  76]  intro¬ 
duced  a  notion  of  “quasirandom”  function  as  one  way  of  defining  functions  far  from  being 
Dictator.  One  can  think  of  it  as  functions  f  which  have  correlation  at  most  o(l)  with  every 
“junta”  (function  depending  on  only  0(1)  coordinates).  Another  way  of  thinking  of  these 
functions  is  that  we  cannot  have  a  procedure  of  outputting  an  0(1)  list  of  coordinates  of 
the  functions  satisfying  the  following  property:  when  we  permute  the  function,  the  corre¬ 
sponding  coordinates  in  the  list  is  also  permuted.  We  refer  to  such  tests  as  “Dictator-vs. - 
quasirandom  Tests”.  As  Hastad  and  others  have  demonstrated,  Dictator- vs. -quasirandom 
Tests  can  often  be  used  to  prove  optimal  inapproximability  results  for  CSPs. 

Definition  2.6.2.  (Informal.)  A  Dictator-vs. -quasirandom  Test  has  soundness  at  most  s  if 
every  quasirandom  function  is  accepted  with  probability  at  most  s  +  o(l). 

Let  us  use  the  3-CSP  as  an  example.  Suppose  is  a  3-query  Dictator-vs. -quasirandom 
Test  on  functions  f  :  {0,  l}n  — * [0, 1}.  Imagine  we  consider  all  possible  random  choices  of  3~ , 
and  in  each  case  write  down  the  (up  to)  3  strings  x,  y,  z  queried  and  the  predicate  applied 
to  the  outcomes  to  decide  accept/reject.  The  complete  behavior  of  ST  might  then  look  like 
the  following: 

with  probability  pi,  accept  iff  /\x(1))  v  f(y (1^)  v  f(z^) 

with  probability  P2,  accept  iff  i /\x(2))  v  /(y(2)) 

with  probability  ps,  accept  iff  /\x(3))  v  -|/Xy(3))  v  -i  f(z^) 


This  is  precisely  an  instance  of  Max-3CSP,  in  which  the  “variables”  are  the  /(x)’s.  Note 
that  the  weights  pi  indeed  sum  up  to  1.  More  generally,  if  3~  makes  at  most  k  queries  it 
can  be  viewed  as  an  instance  of  Max-^CSP.  Further,  suppose  that  ST  “uses  the  predicate 
set  O”  —  i.e.,  its  decision  to  accept/reject  is  always  based  on  applying  a  predicate  from  the 
set  O  to  its  query  responses.  Then  ST  can  be  viewed  as  an  instance  of  Max-O.  The  above 
example  illustrates  a  tester  which  uses  the  set  of  ORs  of  up  to  3  literals;  thus  it  can  be 
viewed  as  an  instance  of  Max-3Sat. 

Suppose  that  ST  is  a  Dictator  Test  with  completeness  at  least  c.  Then  the  Opt  of  the 
associated  CSP  is  at  least  c;  indeed,  there  are  n  distinct  solutions,  the  dictators,  of  value 
at  least  c.  More  crucially,  suppose  further  that  ST  is  a  Dictator-vs. -quasirandom  Test 
with  soundness  at  most  s.  This  means  that  any  solution  f  satisfying  slightly  more  than 
weight  s  of  the  constraints  must  be  slightly  correlated  with  a  junta  on  constant  number  of 
coordinates;  i.e.,  it  must  “highlight”  a  small  number  of  dictators.  These  two  properties  of 
the  test,  taken  together,  make  it  useful  as  a  gadget  in  an  NP-hardness-of-approximation 
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reduction.  Specifically,  if  2T  uses  predicate  set  O,  it  can  be  used  to  prove  hardness  for  the 
Max-O  problem.  Indeed,  in  the  study  of  inapproximability,  one  has  the  following  “Rule  of 
Thumb”: 


Rule  of  Thumb  for  CSPs  For  the  Max- O  problem,  to  prove  that  Max  O  (c,s)  is  hard, 
construct  a  Dictator-vs. -quasirandom  Test  using  <D,  with  completeness  c  and  soundness  s. 
We  call  the  pair  (c,s)  a  dictator-vs. -quasirandom  gap. 

It  is  natural  to  ask  among  all  the  Dictator  Test  with  completeness  c,  how  small  could 
the  soundness  s  be?  . 

Definition  2.6.3.  We  define  the  dictator-vs. -quasirandom  gap  curve  by 

GapTest(c)  =  inf{s  :  (c,s)  is  a  dictator-vs. -quasirandom  gap). 

2.6.2  Dictator  Test  for  Learning 

The  dictator  test  is  also  very  useful  in  the  learning  problems;  it  is  of  a  somewhat  different 
form:  we  can  only  make  one  query  on  a  Boolean  function  f(xi,..xn);  however,  we  can 
assume  that  f  is  in  some  simple  function  classes  that  we  want  to  prove  hardness  results 
for.  Another  difference  is  that  the  dictator  test  in  learning  usually  checks  whether  two 
functions  are  “matching  dictator”  as  we  will  explain  further. 

For  the  sake  of  exposition  of  the  usage  of  a  dictator  test,  let  us  sketch  a  proof  for  the 
hardness  of  HS-MA  (1  -  e,  1/2  +  e). 

Proposition  2.6.4.  Assuming  the  UGC,  the  problem  HS-MA  (1  -  e,  1/2  +  e)  is  NP -hard. 

As  is  mentioned,  the  same  hardness  result  (based  onP/  NP)  has  been  established 
in  [53,  68].  However,  the  following  construction  is  different  from  (and  somewhat  simpler 
than)  the  other  proofs;  it  helps  to  illustrate  the  relationship  between  hardness  of  learning 
and  Dictator  Test. 

Given  an  instance  5£  of  Unique-Games,  we  will  produce  a  set  of  labelled  examples 
such  that  the  following  holds:  if  5£  is  almost  satisfiable  instance,  then  there  is  a  halfspaces 
that  agrees  with  1  —  e  fraction  of  the  examples,  while  if  5£  is  a  near  unsatisfiable  instance 
then  no  halfspace  has  agreement  more  than  ^  +  e.  Clearly,  a  reduction  of  this  nature 
immediately  implies  Proposition  2.6.4. 

Let  5£  be  an  instance  of  Unique-Games  with  an  associated  graph  G  =  ( U,V,E )  and  a 
set  of  labels  [k\.  The  examples  we  generate  will  have  (|V|  +  \U\)k  coordinates,  i.e.,  belong 
to  These  coordinates  are  to  be  thought  of  as  one  block  of  k  coordinates  for  every 

vertex  w  e  U  u  V .  We  will  index  the  coordinates  of  x  e  [R^I+TDA  asx  =  {oclw)WeU\JV 

Also  for  any  halfspace  function  f  :  we  use  the  notion  of  fw  for  the  restriction 

of  f  on  some  vertex  w  e  U  u  V  by  setting  or  the  coordinate  xl ,  =  0  when  w'  ^w.  Similarly, 
for  a  particular  edge  e,  we  denote  fe  for  edge  e  e  E  as  f’s  restriction  by  setting  all  xl ,  to  be 
0  for  w'  £  e. 

For  every  labelling  l  :  U  u  V  — »  \k\  of  the  instance,  there  is  a  corresponding  halfspace 
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over  ojflV’l+l^’Dfc  given  by, 

Sgn(£4«“»-I>«<">>). 
ueU  veV 

The  idea  is  to  construct  a  distribution  of  examples  properly  such  that  if  there  is  a 
good  labelling  for  the  Unique-Games,  then  the  above  corresponding  halfspace  has  a  good 
agreement  rate.  On  the  contrary,  if  any  halfspace  with  ^  +  e  agreement  somehow  implies 
a  labelling  of  l  satisfying  a  constant  fraction  of  the  edges  in  5£ . 

Fix  an  edge  e  =  (u,v).  For  the  sake  of  exposition,  let  us  assume  ne  is  the  identity 
permutation  for  every  i  e  [£].  For  each  edge  e,  we  require  a  set  of  examples  3>e  with  the 
following  properties: 

•  All  coordinates  x'w  for  a  vertex  v  £  e  are  fixed  to  be  zero. 

•  For  any  label  i  e  \k\,  sgn(x^  -xlv)  has  agreement  1  -e  with  the  examples  3>e. 

•If  f  has  agreement  |  +  c  on  the  set  of  examples  3>e,  then  there  exists  a  labelling 
strategy  Lf  for  each  w  eU  uV  solely  based  on  fw  such  that,  Lf  satisfies  the  edge  e 
with  non-negligible  probability. 

As  the  distribution  of  3>e  looks  at  the  restriction  of  f  on  edge  e  which  can  be  viewed 
as  a  halfspace  on  [R2/j  — ►  R,  we  can  rephrase  above  requirement  as  a  pure  property  testing 
problem.  Given  a  degree  halfspace  function  fe  :  R2*  — ►  R,  we  need  a  randomized  procedure 
of  generating  examples  that  has  the  following  property: 

The  procedure  must  satisfy: 

•  (Completeness)  If  fe(x)  -xlu-xlv  then  the  test  accepts  with  probability  1  -  e. 

•  (Soundness)  If  the  test  accepts  with  probability  |+e,  then  we  can  output  a  coordinate 
of  fu  and  a  coordinate  of  fv  such  that  they  match  each  other  with  non-negligible 
probability. 

As  we  can  see  here,  above  test  not  only  check  whether  function  fu,fv  is  a  dictator.  In 
addition  it  only  accepts  when  they  are  dictator  with  matching  coordinate. 

We  claim  that  following  test  will  serve  the  goal: 


Matching  Dictator  Test  3\ 

Choose  e  to  be  and  8  to  be  “extremely  small” 

1.  Generate  independent  e-biased  bits  ai,a2,...,a„  e  {0,1}  (i.e.,  aj  =  1  with  probability 
e  and  0  with  probability  1  -  e). 

2.  Generate  2 n  independent  unit  Gaussian  random  variables: 

h1,h2...,hk,gi,g2---,gk- 

3.  Generate  a  random  bit  b  e  {-1, 1}. 

4.  Setr  =  (a1h1  +  gi,...,akhk+gk,gi,---,gk)  and  u  =  (1,1,1,...,  1,0,..., 0)  £  R2/j. 

5.  Set  y  -  r  +  b8u. 

6.  Accept  if  sgn(/'e(y))  =  b. 

Suppose  that  fe(x)  =  8  +  Lf=1  wuxu  +  wlvxlv.  Also  without  loss  of  generality  assume 
that  Lf=1  u)lv  =  1.  Then  we  know  fe(y )  =  fe(r)  +  bS.  Essentially,  the  test  checks  below  two 
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inequalities  with  equal  probability 

•  fe(r)<6 

•  fe(r )  >  -8 

Since  at  least  one  of  the  above  two  inequality  will  hold,  the  passing  probability  of  fe  is 
i  +  JPr  (fe(r)c(-8,8)). 

As  8  is  extremely  small,  roughly  we  can  think  of  the  passing  probability  of  fe  to  be 
^  +  ^Pr (f(r)  =  0).  On  the  completeness  side,  it  is  easy  to  check  that  for  fe  =  xlu  -  xlv,  f(r )  = 
ciihi  and  Pr (f(r)  =  0)  =  1  -  e.  Overall,  it  passes  the  test  with  probability  1  -  e. 

On  the  soundness  side  as  fe(r)  =  +  wlv)gi  +  Y^wluaihi,  to  make  Pr (f(r)  =  0)  to 

be  non-negligible,  we  must  have  wlu  +  w lv  =  0  for  each  i.  Also  there  must  be  very  “few” 
nonzero  wlu  as  otherwise  T.^u^ihi  will  not  vanish.  Then  a  good  labelling  strategy  would 
be  randomly  output  a  coordinate  with  nonzero  weights  in  fu  and  fv.  As  there  are  very  few 
such  coordinates,  we  know  with  non-negligible  probability,  they  will  match. 

Generally  speaking  following  is  the  rule  of  thumb  for  proving  hardness  of  learning 
results 

Rule  of  Thumb  for  Learning  To  prove  ^-iif-MA  (c,s)  is  hard,  construct  a  one  query 
matching  Dictator  Test  such  that  dictator  functions  in  &  pass  with  probability  at  least  c 
while  non-dictator  functions  in  J€  pass  with  probability  s. 
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Chapter  3 

Mathematical  Tools 
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In  this  Chapter,  we  summarize  the  mathematical  tools  that  is  used  throughout  the 
thesis. 

3.1  Probability  Theory 

3.1.1  Product  Space 

The  usual  way  of  defining  a  probability  space  is  a  triple:  a  sample  space  G,  a  er-algebra, 
and  a  probability  measure  £?.  In  this  thesis  as  we  mostly  study  the  probability  space 
of  which  G  is  a  finite  set  (with  the  exception  of  the  Gaussian  distribution),  we  denote  a 
probability  space  by  a  pair  (G,p)  where  G  is  the  sample  space  and  p  :  G  — ►  (0, 1]  is  the 
density  function. 

Definition  3.1.1.  (finite  probability  space)  Let  G  be  a  finite  set  of  events  {ei,...eq}.  We 
denote  (G,  p)  to  be  a  probability  space  where  p  :  G  — *  (0, 1]  is  the  probability  measure  on  G 
such  that  Lf=1  p(eg)  =  1.  The  minimum  atom  probability  ofQ.  is  defined  to  be  min;e[g]  p(e  j). 
Definition  3.1.2.  (inner  product)  Given  a  finite  probability  space  (G,p)  and  |G|  =  q,  we 
know  that  the  function  space  3P  -  [f  \  f  :  G  -*■  IR}  is  a  q-dimensional  vector  space;  we  define 
the  inner  product  induced  by  the  probability  measure  p  as  follows:  For  any  f,g:Q.—>  IR, 

<  f,g  >=  E  e~(n,n)\-f(e)-g(e)l 
Definition  3.1.3.  For  all  f  :  G  — * ►  IR,  we  define  its  p-norm  as 

\\f\\P  =  ^e~(n,,me)\p])1,p. 

Definition  3.1.4.  (Ensemble)  Given  a  finite  probability  space  (G,p).  Suppose  that  |G|  =  q. 
We  call  the  collection  of  functions  an  ensemble  if  Xo,--Xq-i  an  basis  for  FP . 

Further,  we  call  an  ensemble  an  orthogonal  ensemble  if  the  ensemble  forms  an  orthogonal 
basis  and  xo  is  the  constant  1  function;  i.e.,  jo(e)  =  1  for  any  e  e  G. 

To  characterize  a  finite  probability  spaces,  we  can  either  use  (G,p)  or  an  orthogonal 
ensemble  (jo  =  1, . . . , Xq- 1)  °n  if. 

Next,  we  introduce  the  definition  of  the  product  of  probability  spaces: 

Definition  3.1.5.  (Product  Space)  For  probability  spaces  (Gi,pi),(G2,p2),---(G„,pw),  we 
define  their  product  probability  space  (G,p)  =  n”=1(Gj,p;)  as  follows  :  the  sample  space 
is  G  =  UUQi  =  dei’---era)  I  i  e  Vn\,ei  e  Gj}  and  the  probability  measure  p  on  any  event 
(ei,...e„)e  n"=1Gj  is  defined  to  be  n”=1Pi(e;)- 

For  simplicity  we  assume  that  each  Gj  has  the  same  cardinality  q.  Also  for  each 
function  space  =  {f  I  f  :  Gj  — ►  IR},  we  denote  its  orthogonal  ensemble  on  it  as  {xi, o  = 
1,  T j,i,  •  •  • ,  Xi,q-if  By  the  fact  from  basic  linear  algebra,  the  function  spaces  on 

&  =  {f  I  flG^R} 

i= 1 

has  an  orthogonal  basis  {ya  '■  — 1 -  R  I  a  e  VqY1}  with  each  Xa  defined  as  follows:  for 

xen”=1Gj, 

n 

Xa(x)=  UXisMi). 
i= 1 
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Therefore  any  function  f  £  &  can  be  written  as  a  linear  combination  of  the  basis: 


fix)  =  Y,  fi^XaM 

ae[q]n 


We  call  f{a)  to  be  f’s  Fourier  coefficients  on  a,  where  a  £  [qY1  are  also  referred  as  the 
multidimensional  index. 

As  each  Xi,o  is  the  constant  1  function,  we  can  ignore  them  in  the  expression  of  Xa\  he., 
write  xi(x)  as  Wai^oXi,aSxi).  We  therefore  define  the  active  elements  for  any  a  e  [qY1  to  be 

S(a)  =  {i  |  <T i  f  0}. 

For  any  o  £  [qY1,  we  define  deg(jCT),  the  degree  of  the  term  Xa,  to  be  the  number  of  active 
elements  |S(er)|. 

As  any  function  in  &  can  be  viewed  as  a  multilinear1  polynomial  on  functions:  {%i,j  I  i  £ 
[n],  1  <  j  <  q).  The  degree  of  a  function  is  then  defined  by  the  maximum  degree  among  all 
of  its  term  with  nonzero  Fourier  coefficients. 

Definition  3.1.6.  (Degree)  For  function  f  =  Y.<re[q\n  fi^Xaix),  we  define  its  degree  as  fol¬ 
lows: 

deg(/)  =  max{|  deg(Xo-)  I  f(a)  f  0,cr  £  [qY1  ■} 

Following  is  a  relationship  between  the  Fourier  representation  and  variance  of  a  func¬ 
tion: 

Fact  3.1.7. 

Var  (/)=  £  /(a)2. 

|S(ff)|>l 

For  any  S  Q  [n\,  if  we  take  fs  to  be  Ls(o-)=s  A0"). and  write  f  as  the  sum  of  fs,  we  get 
the  Efron-Stein  Decomposition. 

Theorem  3.1.8.  (Efron-Stein  Decomposition  [45])  Let  {Q.i,gi),..{Q.n,gn)  be  discrete  prob¬ 
ability  spaces.  Then  for  f  :  JIAi  Ah)  - ”  for  &  -  if  we  write  f  as 

fix)  =  Yfsix), 

we  call  above  representation  the  Efron-Stein  Decomposition  of  f.  Such  a  decomposition  has 
the  following  properties: 

•  fs(x)  depends  only  on  variables  in  xs- 

•  For  all  S  <£S'  and  as*  £  flies'  it  holds  that  E[fs(x)\xs'  =  agd  =  0. 


3.1.2  Influence,  Noise  and  Stability 

Given  a  function  f  :  riAi^T  — *•  IR  on  a  probability  space  (Ll,p)  =  Y\(Yli,pn),  we  can  define  its 
influence  on  the  i-th  input  as  follows: 

1By  multlinear,  we  mean  that  the  power  of  %i  in  each  is  at  most  1. 
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Definition  3.1.9.  (Influence)  The  influence  of  the  i-th  coordinate  is  defined  to  be 

—  . Xi-i,Xi+i,.. 

The  influence  of  f  on  i-th  coordinate  is  the  average  variance  of  f  over  the  possible 
configurations  on  the  other  coordinates. 

The  influence  of  a  function  can  be  represented  in  terms  of  the  Fourier  Coefficients: 

Fact  3.1.10. 

InW)=  £  f((jf. 

ieS(a) 

We  can  also  generalize  above  notion  to  define  the  low  degree  influence  of  a  function  as 
follows: 

Definition  3.1.11.  For  any  integer  d,  Inf fd(f)  =  'LieS(a),\S(a)\<d  /(o')2- 

The  sum  of  all  the  low  degree  influence  is  bounded  by  d  times  of  the  variance. 

Fact  3.1.12. 

£lnffV)<d-Var(A 

i- 1 

Next  we  define  a  important  concept  called  Noise  Operator. 

Definition  3.1.13.  For  a  probability  space  n”=1(Oj,/i;)  and  0  <  p  <  1,  we  define  the  noise 
operator  Tp  on  functions  on  f  :  Q  — *■  [R  as  follows: 

Tpf(x)  =  E  [f(x% 

where  x’  has  the  following  distribution:  independently  each  x'  is  set  to  be  xt  with  probability 
p  and  sampled  from  ( £li,pi )  with  probability  1  -  p.  Also  we  define  the  noise  stability 

Sp  =  E  XtAf(x)f(x')]  =  E  x[f(x)Tpf(x)] 

We  have  the  following  facts. 

Proposition  3.1.14. 

Tpf=YJpwh°y 

Proposition  3.1.15. 

Spf  =  J^plalf(a)2. 

3.2  Advanced  Probability  Machineries 

In  the  following,  We  introduce  two  tools  in  analyzing  functions  on  product  probability 
spaces. 
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3.2.1  Invariance  Principle 

The  invariance  principle  [116]  characterizes  the  asymptotic  behaviour  of  low  influence 
functions  over  product  distribution. 

There  are  multiple  versions  of  the  invariance  principle  and  we  state  the  using  noisy- 
influences  rather  than  low-degree  influences  (for  an  sketch  of  the  proof,  one  can  look 
at  [126]). 

Theorem  3.2.1.  For  probability  space  (Cl,p)  =  where  each  ( G;,p; )  has  an 

ensemble  SFt  -  (Xi,o>—Xi,q-i)-  Let  c@i  -  (g;;o  =  l,...g;j(?_i)  follows  the  multivariate  Gaussian 
distribution  with  their  covariance  matrix  specified  by  the  following  "matching  moments" 
condition:  for  any  i  e  [ra]  and  e  [<?] 


E[Tij'iXij2]  - 


(3.1) 


^1,^2,  ■■■cSn  are  all  independent  with  each  other. 

Also  the  minimum  atom  probability  among  all  (G,  pf)  is  at  least  a.  Let  f(fX\,. . . ,  3Fn)  be 
a  real-valued  and  assume  that  max;  In^Ti-e/-  <  t.  Then  for  any  function  if/(t) :  R  — » IK  with 
bounded  3-rd  derivative  \y/'"(t)\  <B, 

\E[yr(T^e  f(Xlt ...,  SFn))]  -  E  ty(7W(#ls .  •  • ,  »*))]  |  *  °  w(l). 

Here  r,e,a,B  are  all  constant  independent  ofn. 

Above  invariance  principle  applies  to  functions  defined  on  a  single  product  probability 
space;  later  in  [115]  (also  see  [43]),  Mossel  generalized  above  result  to  vector  valued  func¬ 
tions  and  product  of  functions  on  correlated  probability  spaces.  To  state  his  results,  first 
let  us  define  the  correlation  between  two  probability  spaces. 

Definition  3.2.2.  Let  (G  x  ©;p)  be  a  finite  probability  space.  Define  the  correlation  between 
G  and  0  to  be: 

p(G,0;p)  =  suplCovt/jg] :  f  :  G  —  U,g  :  0  —  i^Vart/1]  =  Var[g]  =  1}. 

The  conditional  operator  U f  associated  with  pis  a  mapping  from  function  space  {f\f  :  0  — > ► 
K}  to  {g|g  :  G  — ►  K]  defined  as  follows:  for  f  :  0  — ►  K  and  any  xo^O  and  random  variable 
pair  x  e  G,y  e  0  drawn  from  p,  U^f(x o)  =  Ey[f{y)\x  -  xol 

We  also  define  following  quantity  of  the  Gaussian  Distribution. 

Definition  3.2.3.  Let  O(x)  be  the  CDF  function  of  one  dimension  Gaussian  Distribution, 
g i  and  g 2  be  bivariate  Gaussian  random  variables  with  mean  zero  and  covariance  matrix 

.For  p  e  [-1,1],  we  define  rp,Tp  :  [0,1]2  — » [0,1]  by 

fp(<5i, 62)  =  Pr(gi  <  0_1(5i)  A g2  <  o-1^)); 
rp(di,  82)  =  Pr(gi  <  o-^di)  A  g2  >  ®-\i  -  d2)l 

When  81  -  82  -  8,  we  simplify  the  above  notations  by  rp(<5)  and  Tp(d). 

Following  theorem  is  a  generalization  of  the  invariance  principle  for  functions  on  cor¬ 
related  probability  spaces  [115]: 


1  P 
P  1 
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Theorem  3.2.4.  Let  {(flj  x  0j,p)}"=1  be  a  collection  of  correlated  probability  spaces  and 
assuming  that  the  p(D.i,&i,pf)  <  po  for  each  i  e  \n\The  probability  space  (Gx0,p)  is  defined 
to  be  n”=1(fi;  x&i,  pi)  and  functions  f  :  G  — »  Uand  g  :  0  — »  K  has  the  property  that  that  E[/]  = 
<5i  and  E[g]  =  82.  (Here  the  expectation  is  taken  with  respect  to  the  marginal  distribution 
of  pi,p2  on  G  and  Q).  Assume  the  minimum  atom  probability  among  all  (Gj  x  &t,pi )  for 
i  e  [u]  is  at  least  a.  If  f  and  g  also  satisfy  the  following  influence  property: 

maxmin(InfjTi_e/,Inf,Ti_eg-)  <  t. 


then  we  have  that 


rpo(pi,P2)  +  0T)tt>e(l)  <  E [f-g]  <  TPo(8i,82)  +  0T,a,e( D 

3.2.2  Hypercontractivity 

Hypercontractivity  provides  us  another  tool  of  analyzing  functions  defined  on  product  of 
probability  spaces. 

Definition  3.2.5.  We  say  that  a  real  random  variable  x  is  (p,q,  p)-hypercontractive  for 
1  <  q  <  p  <  00  and  0  <  p  <  1  if  ||x||p  <  00,  and  for  all  ael,  ||a  +  tjx  ||p  <  ||a  +  x||g. 

For  a  discrete  distribution,  it  is  known  to  have  the  following  hypercontractivity: 
Theorem  3.2.6.  Let  (fl,  p)  be  a  finite  probability  space  with  minimum  atom  probability  a. 
Then  every  function  f  :  D.  — ►  [R  with  E[/]  =  0  is  (2  ,p,rjp(a))  hyper  contractive  with 


Tjp(a) 


I  AyP  -A~yP 

Allp'_A-llp' 


where  A  =  and  1  Ip  +  lip1  =  1. 

As  for  continuous  distributions,  following  hypercontractivity  theorem  is  known  for  the 
Gaussian  Distribution: 

Theorem  3.2.7.  Let  IS  be  a  one-dimensional  Gaussian  Distribution,  the  <£  is  (2 ,q,  1  lyj q  -  1)- 
hypercontractive. 

Now  we  state  the  hypercontractivity  theorem  for  low  degree  polynomials  on  product 
probability  space 

Theorem  3.2.8.  If  a  probability  space  (G,p)  is  (2 ,p,rj)  hypercontractive,  then  a  Degree  d 
polynomial  f  :Q.n  —►Ron  probability  space  (( D.n,pn )  is  (2  ,p,rjd)  hypercontractive. 

In  addition,  the  following  “hypercontractive  inequality”  [23,  62]  is  known  for  functions 
applied  with  the  noise  operator. 

Theorem  3.2.9.  Suppose  0  <  p  <  1  and  q  >  2  satisfy  that  p  <  1  l\J(q  -  1  )/(p  -  1).  Then  for 
all  f  :{- 1, 1}"  — »  R  and  assume  the  distribution  is  uniformly  random  on  {-1,  l}n,  then 

\\Tpf\\q  <  ll/llp. 

For  a  large  domain  such  as  [ q]' l,  the  optimal  bound  was  first  proved  by  Diaconis  and 
Saloff-Coste  [40];  the  following  uses  their  Theorems  3.5.ii  and  A.l  plus  Holder  duality: 
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Theorem  3.2.10.  Let  q  >  2,  f  :  [g]re  — > -  IR,  and  0  <  e  <  1.  Also  assume  the  distribution  is 
uniform  distribution  on  \q]n.  Then 

II T^fh  <  \\f\\p,  where  p  =  p(q,e )  =  1  +  (1  -  c)(2-4/g)/log(<7-D_ 
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Part  II 

CSPs  and  SDP 
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Chapter  4 

Approximation  Curve  for  Max  Cut 
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4.1  Introduction 


The  Max  Cut  is  a  Boolean  CSP  with  the  ‘V  constraints.  It  is  also  equivalent  to  the  fol¬ 
lowing  graph  problem.  Given  an  undirected  graph  G  =  ( V,E ),  the  Max  Cut  problem  asks 
for  a  partition  of  the  vertices  into  two  sets  so  as  to  maximize  the  number  of  edges  connect¬ 
ing  the  two  sets.  It  is  one  of  the  classic  NP-hard  problems  from  Karp’s  list  of  21  [88]  and 
is  arguably  the  simplest  NP-hard  problem.  To  cope  with  its  NP-hardness  and  to  under¬ 
stand  hard  instances,  there  has  been  a  variety  of  work  on  its  approximation  algorithms. 
The  greedy  algorithm  (or  the  random-assignment  algorithm)  is  easily  shown  to  have  an 
approximation  ratio  of  ^  (see  [129]).  Goemans  and  Williamson  [59]  gave  a  SDP  rounding 
algorithm  achieving  a  .878  approximation  ratio.1  Since  the  early  ’90s,  there  is  a  large 
amount  of  interest  in  the  SDP  relaxation,  in  approximation  algorithms,  and  in  hardness 
of  approximation  for  Max  CUT  [3,  4, 15,  30,  36,  37,  47,  48,  50,  59,  73,  86,  99, 102, 107, 142]. 
In  this  Chapter,  we  build  on  the  results  in  many  of  these  papers  and  determine  an  essen¬ 
tially  complete  picture  of  the  optimal  approximation  algorithms,  SDP  gaps,  Dictator  Tests, 
and  UGC-hardness  for  Max  Cut. 


4.1.1  Definitions 

We  begin  with  the  basic  definitions.  We  generally  work  with  edge-weighted,  undirected 
graphs  G  =  ( V,E,w ),  where  w  :E  — *■  IR-°  gives  the  nonnegative  edge  weights.  The  issue  of 
self-loops  turns  out  to  be  a  nuisance;  our  policy  will  be  to  disallow  them  unless  otherwise 
specified.  Without  loss  of  generality,  we  will  always  assume  the  edge  weights  sum  to  1; 
i.e.,  'Le.cE  w(e)  =  1.  Thus  we  can  think  of  the  weights  as  giving  a  probability  distribution  on 
edges;  we  will  therefore  omit  w  and  think  of  E  as  a  (symmetric)  probability  distribution 
on  edges,  writing  (u,v)  ~E  to  denote  a  draw  from  this  distribution. 

Definition  4.1.1.  A  (proper)  cut  in  G  is  a  partition  of  the  vertices  into  two  parts,  h  :  V  — » 
[-1, 1}.  The  value  of  the  cut  is 

val G(h)=  Pr  [h(u)fh( v)\=  E  [b  -  bh(u)h(v)\. 

(u,v)~E  ( u,v)~E  z  z 

The  Max  Cut  problem  is  the  following:  Given  G,  find  a  proper  cut  h  with  as  large  a  value 
as  possible. 

In  general,  we  prefer  the  second  definition  of  value  given  above,  since  it  generalizes  to 
fractional  cuts : 

Definition  4.1.2.  A  fractional  cut  in  G  is  a  function  h  :  V  — ►  [-1,1].  The  value  of  the 
fractional  cut  is 

valo(/i)=  E  [b  -  bh(u)h(v)]. 

( u,v)~E  2  Z 

Given  a  fractional  cut  h,  we  can  randomly  produce  a  proper  cut  h'  by  setting  each  value 
h'(v)  to  be  1  with  probability  |  +  \h{v)  and  -1  with  probability  ^  -  \h(v),  independently 

1The  SDP  relaxation  itself  was  given  earlier  by  Delorme  and  Poljak  [37],  who  noted  it  was  polynomial¬ 
time  computable. 
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across  v’s.  In  this  way,  E[/i'(i;)]  =  h(v).  It  follows  that  E[vale(A0]  =  valgih)  (although  this 
uses  the  fact  that  G  has  no  self-loops).  Hence  there  always  exists  a  proper  cut  h!  with 
value  at  least  val/,(G),  and  furthermore  such  a  cut  can  easily  be  found  deterministically 
from  h  using  the  method  of  conditional  expectations.  For  these  reasons,  we  will  henceforth 
treat  the  Max  Cut  problem  as  being  about  finding  a  fractional  cut  with  as  large  a  value 
as  possible,  and  we  will  refer  to  fractional  cuts  simply  as  ‘cuts’. 

Definition  4.1.3.  The  optimum  cut  value,  or  Max  Cut,  for  G  is  denoted 

Opt(G)  =  sup  val  a(h). 

/i:V— [-1,1] 

Note  that  the  optimum  is  always  at  most  1  and  at  least  \  (since  the  fractional  cut  h  =  0 
is  always  available). 

4.1.2  SDP  Gaps  of  Max  Cut 

All  of  the  best  approximation  guarantees  for  Max  Cut  currently  known  are  achieved  by 
algorithms  using  the  SDP  [37,  49,  59,  123]: 

Definition  4.1.4.  The  (Max  Cut)  SDP  value  of  a  graph  G  is 

Sdp (G)=  max  E  [|  -  \g{u)  ■  g(v)\  (4.1) 

g-.V^Bn(u,v)~E  2  2 

where  n  =  \V \  and  Bn  denotes  {xeH'1:  ||x||  <  1}.  Note  that  Sdp(G)  >  Opt(G),  as  g  can  always 
be  taken  to  map  into  [-1, 1]. 

We  should  note  that  for  graphs  without  self-loops,  it  is  easy  to  see  that  the  optimal 
embedding  maps  all  vertices  to  the  boundary  of  the  ball. 

Recall  the  following  definition  of  SDP  gap  for  Max  Cut.  Note  that  as  there  is  a  trivial 
way  of  finding  a  cut  of  value  above  we  only  consider  (c,s)-gap  for  c  >  s  > 

Definition  4.1.5.  For  |  <  s  <  c  <  1,  we  call  the  pair  (c,  s)  an  SDP  gap  if  there  exists  a  graph 
G  with  Sdp(G)  >  c  and  Opt(G)  <  s.  We  define  the  SDP  gap  curve  by 

GapgDP(c)  =  inf{s  :  (c,s)  is  an  SDP  gap}. 

Triangle  inequalities.  One  can  also  consider  strengthening  the  SDP  by  adding  the 
‘triangle  inequalities’:  i.e.,  enforcing 

g(v i)  •  g(v2)  -  g(v 2)  •  g(v 3)  -  g(vi)  •  g(v3)  >  -1, 

g(v l)  •  g(v 2)  +  g( V2)  •  g(v 3)  +  g(v  1)  •  g( V3)  >  -1, 

for  all  vi,  V2,  vs  e  V.  All  of  our  positive  results  (rounding  algorithms)  will  hold  without  the 
triangle  inequalities,  and  we  focus  attention  in  this  work  almost  exclusively  on  the  basic 
SDP  (4.1).  However,  we  will  also  show  that  all  of  our  negative  results  (SDP  gaps,  algorith¬ 
mic  limitations)  hold  even  with  the  triangle  inequalities. 

We  analogously  define  the  curve  GapASDP  for  the  SDP  with  the  triangle  inequalities. 
Of  course,  we  have  GapAgDP(c)  >  GapgDP(c)  for  all  c. 


57 


4.1.3  RPR2  Algorithms 

The  GW- algorithm’s  approximation  curve  is  as  follows: 


ApxGW(c) 


^  arccos(l  -  2c)  if  c  >  .844, 
.878 c  if  c  <  .844. 


There  has  been  serval  improvements  to  achieve  a  better  approximation  curve  (particu¬ 
larly  for  c  <  0.844).  Generalizing  the  GW  algorithm,  Feige  and  Langberg  [48]  introduced 
the  ‘RPR2’  (Randomized  Projection,  Randomized  Rounding)  framework  for  rounding  the 
solutions  of  SDP  relaxations: 

Definition  4.1.6.  An  RPR2  algorithm  for  MAX  CUT  is  defined  by  a  rounding  function, 
r  :  IR  — »  [-1, 1].  Given  a  graph  G,  the  steps  of  the  algorithm  are  as  follows: 

1.  Use  SDP  to  find  an  optimal  embedding  g  :  V"  — »  S"-1  for  the  SDP  (4.1). 

2.  Choose  a  random  vector  Z  eUn  according  to  the  n-dimensional  Gaussian  distribu¬ 
tion. 

3.  Output  the  (fractional)  cut  h  :  V  — »  [-1, 1]  defined  by  h(v)  =  r(g(v)  ■  Z). 

(Certain  implementation  details  of  the  RPR2  method  are  discussed  in  Section  4.13.) 

All  of  the  known  SDP  algorithm  for  Max-Cut  fall  into  the  RPR2  framework.  For 
example,  the  GW  algorithm  is  RPR2  with  rounding  function  r(x)  =  sgn(x);  the  random- 
assignment  algorithm  is  RPR2  with  rounding  function  r(x)  =  0.  Zwick’s  algorithm  [142]  is 
not  obviously  RPR2,  but  it  is  shown  to  be  so  by  Feige  and  Langberg  [48].  In  that  paper,  the 
authors  suggest  using  ‘s-linear’  rounding  functions:  i.e.,  functions  of  the  form  r(t)  -  t/s  if 
-s  <t  <  s,  r(t)  =  1  if  t  >  s,  r(t)  =  -1  if  t  <  -s.  Charikar  and  Wirth’s  analysis  [30]  for  c  =  ^ 
indeed  uses  RPR2  with  s-linear  rounding  functions. 

We  conclude  the  discussion  of  RPR2  algorithms  by  mentioning  that,  given  an  input 
graph  G,  it  can  be  advantageous  to  try  several  different  rounding  functions  r.  It  is  well 
known  (as  discussed  in  Section  4.13)  that  given  a  collection  SI  of  rounding  functions, 
one  can  achieve  the  performance  of  the  best  of  them  with  running  time  slowdown  only 
Oi\Si\\og\St\).  Indeed,  Feige  and  Langberg  even  suggested  the  idea  of  trying  ‘all’  possible 
rounding  functions,  up  to  some  e-discretization.  Whether  or  not  this  achieves  the  perfor¬ 
mance  of  the  ‘optimal’  rounding  function  up  to  an  additive  e  is  a  tricky  issue  which  we 
discuss  further  in  section  4.2.2. 


4.1.4  Dictator  Tests  of  V" 

For  Max  Cut  one  needs  a  dictator  test  making  only  2  queries  and  testing  fix)  f  fiy).  The 
rule  of  thumb  is  that  giving  a  such  a  test  with  ‘completeness’  c  and  ‘soundness’ s  may  allow 
one  to  derive  a  c  vs.  s  inapproximability  result.  (We  give  concrete  theorems  along  these 
lines  later  in  this  section). 

Let  us  briefly  recall  some  of  the  relevant  definitions: 
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Definition  4.1.7.  A  2-query,  ^-based  Dictator  Test  for  functions  f  :  {-1,1}"  —  {-1,1}  is 
a  randomized  procedure  for  choosing  two  strings  x,y  e  {-1,1}".  We  think  of  the  test  as 
querying  fix)  and  fiy),  and  then  accepting  when  fix)  f  fiy),  and  rejecting  otherwise. 
Definition  4.1.8.  The  completeness  of  a  Dictator  test  T  for  n-bit  functions  is 

Completeness!?1)  =  min{Pr[T  accepts  %i U, 

i£[n] 

where  %i  '■  {-1,  1}”  {-1,  1}  is  the  ith  ‘Dictator’  function,  Xi(xl  =  xi- 

As  for  the  soundness,  we  defer  the  formal  explanation  to  section  4.7;  for  now,  suffice  it 
to  say  we  make  a  definition  along  the  following  lines: 

Definition  4.1.9.  (informal)  The  soundness  of  a  Dictator  Test  T  for  functions  f  :{- 1, 1}"  - 
[-1,1]  is 

Soundness(T)  =  max{Pr[T  accepts  f]:  f  is  ‘quasirandom’} . 

In  addition  to  the  unspecified  notion  ‘quasirandom’,  the  reader  will  notice  that  we  have 
generalized  to  testing  functions  whose  range  is  [-1, 1]  rather  than  {-1, 1}.  The  reason  for 
doing  this  is  that  all  the  applications  we  present  require  this  generalized  setting.  The  dis¬ 
tinction  is  similar  to  the  one  between  proper  and  fractional  cuts.  Again,  formal  definitions 
appear  in  Section  4.7. 

Definition  4.1.10.  (informal)  We  call  the  pair  ic,s)  a  dictator-vs. -quasirandom  gap  if  for 
all  q  >  0,  for  sufficiently  large  n  there  is  a  dictator-vs. -quasirandom  test  T(n)  for  functions 
f  :  {-1,1}"  — » [-1,1]  with  Completeness!?1^)  >  c  and  Soundness(T('l))  <  s  +  q.  We  define  the 
dictator-vs. -quasirandom  gap  curve  by 

GapTest(c)  =  inf{s  :  (c,s)  is  a  dictator-vs. -quasirandom  gap}. 

As  mentioned,  our  interest  in  dictator-vs. -quasirandom  tests  comes  from  their  applica¬ 
tion  to  algorithmic  hardness  results.  We  give  three  such  applications  here.  The  first  is  the 
original  application,  implicitly  proved  in  [99]: 

Theorem  4.1.11  ([99]).  Suppose  ( c,s )  is  a  dictator-vs.-quasirandom  gap,  and  q  >  0.  Then 
the  UGC  (UGC)  implies  that  it  is  NP-hard  to  distinguish  Max  Cut  instances  with  value 
at  least  c-q  from  instances  with  value  at  most  s  +  q.  I.e.,  assuming  the  UGC  and  P  f  NP 
we  essentially  have  Apx^(c)  <  GapTest(c)  for  all  efficient  algorithms  A  and  all  c. 

(The  ‘essentially’  here  refers  to  the  fact  that  we  really  only  have  Ap xA(c-q)  <  GapTegt(c) 
for  all  q  >  0.  Ultimately  we  will  show  that  GapTest  is  continuous,  so  this  distinction  is  ir¬ 
relevant.) 


4.1.5  Motivation  and  Discussion 

In  this  section  we  discuss  the  motivation  and  merits  of  deciding  the  optimal  approxima- 
bility  curve  of  Max  Cut  for  every  values  of  c. 

First,  MAX  CUT  is  a  fundamental  algorithmic  problem;  indeed,  it  is  arguably  the  sim¬ 
plest  NP  optimization  problem.  For  the  reasons  discussed  in  section  4.1.1,  we  feel  that 
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understanding  its  approximability  for  the  entire  range  of  c  is  important.  We  are  hardly 
alone  in  this  regard;  for  example,  in  2001  Feige  and  Langberg  [48]  wrote  that  they  were 
“trying  to  extend  the  techniques  of  [50]  in  order  to  prove  [that  RPR2  algorithms  can  match 
the  SDP  gap  curve  for  values  of  c  <  .844]”.  Besides  the  algorithmic  work  on  the  Max  Cut 
curve  we’ve  already  described  [30,  48,  59,  142],  there  has  also  been  a  great  deal  of  work 
recently  on  the  very  related  problem  of  the  Max-2Lin  [1,  2,  8,  17,  77].  For  example  the 
Grothendieck/Quadratic  Programming  results  of  [1,  2,  30]  are  nothing  more  than  analysis 
of  the  Max-2Lin  approximability  curve  at  \  +  e  —  with  the  underlying  graph  structure 
fixed  to  be  bipartite,  in  the  Grothendieck  case.  Further,  analyzing  the  Max  CUT/Max- 
2Lin  approximability  curves  at  1  -  e  for  subconstant  e  is  very  strongly  related  to  analyzing 
Sparsest-Cut  approximability. 

Further,  the  fundamental  nature  of  the  Max  CUT  problem  makes  our  inability  to  un¬ 
derstand  its  computational  complexity  all  the  more  galling.  Recall  that  every  value  of  c 
for  which  we  don’t  know  the  largest  efficiently  achievable  value  of  ApxA(c)  yields  a  ba¬ 
sic,  natural  problem  not  known  to  be  in  P  and  not  known  to  be  NP-hard:  e.g.,  “Given  a 
graph  with  a  cut  of  size  60%,  find  a  cut  of  size  55%”.  Without  the  UGC,  it  seems  we  have 
no  idea  how  to  prove  sharp  inapproximability  results,  although  in  this  work  we  did  the 
best  we  could  by  ruling  out  RPR2  algorithms  from  achieving  Apx(c)  >  S(c).  Assuming  the 
UGC,  though,  the  present  work  completely  closes  the  Max  Cut  problem.  Even  if  one  does 
not  believe  the  UGC,  there  are  several  takeaways:  First,  we’ve  shown  that  the  UGC  can¬ 
not  be  disproved  by  giving  good  Max  Cut  SDP  rounding  algorithms,  for  any  value  of  c. 
Second,  our  work  gives  an  improved  approximation  algorithm  inspired  by  UGC/dictator- 
vs. -quasirandom  test  considerations. 

Finally,  we  hope  that  the  methods  developed —  specifically,  the  use  of  Hermite  anal¬ 
ysis,  von  Neumann’s  Minimax  Theorem,  Borell’s  rearrangement  inequality  [24],  and  the 
Karush-Kuhn-Tucker  conditions  —  can  be  used  to  make  progress  on  understanding  SDP 
gaps  and  approximability  of  other  fundamental  problems.  Specifically,  we  believe  our 
methods  should  be  useful  for  attacking  Max-2Sat  and  other  2-CSPs  (some  indication  of 
this  is  given  already  in  the  recent  work  of  Austrin  [13,  14]),  3-CSPs,  and  perhaps  even  for 
determining  the  Grothendieck  constant  [63]. 

4.1.6  Statement  of  Main  Results 

Our  first  result,  from  which  the  remaining  results  derive,  is  a  complete  determination  of 
the  SDP  gap  curve.  We  introduce  an  explicit  function  S  :  [^,1]  —  [^,1],  and  show  that 
GapgDP(c)  =  S(c)  for  all  c.  In  particular,  the  proof  of  the  lower  bound,  GapgDP(c)  >  S(c), 
is  achieved  via  a  poly(n)-time  RPR2  algorithm.  Thus  we  have  an  efficient  algorithm  for 
Max  Cut  which  has  optimal  SDP-approximation  curve.  The  fact  that  an  RPR2  algorithm 
achieves  the  SDP  gap  confirms  a  conjecture  suggested  by  Feige  and  Langberg  [48]. 

Next,  we  show  how  to  transform  the  SDP  results  into  dictator-vs. -quasirandom  testing 
results.  Specifically,  we  are  able  to  show  that  the  dictator-vs. -quasirandom  gap  curve  is 
identical  to  the  SDP  gap  curve;  i.e,  GapTest(c)  =  S(c)  for  all  c  e  [|,1].  This  result  gives  us 
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optimal  dictator- vs. -quasirandom  tests.  In  addition: 

•  The  SDP  gap  curve  with  triangle  inequalities,  GapASDP,  is  also  identical  to  the  curve 
S. 

•  If  A  is  any  RPR2  algorithm  then  ApxA(c)  <  Sic )  for  all  c,  even  assuming  both  of 
the  following:  (i)  A  uses  the  SDP  with  triangle  inequalities;  (ii)  A  is  not  required  to 
choose  Z  to  be  a  random  n-dimensional  Gaussian,  but  rather  is  allowed  to  determin¬ 
istically  select  the  best  Z  satisfying  \\Z\\  =  0(y /n).  (Contrast  this  with  the  fact  that 
in  graphs  exhibiting  the  c  vs.  Sic)  SDP  gap,  our  RPR2  algorithm  actually  finds  an 
essentially  optimal  cut.) 

•  If  A  is  any  polynomial-time  algorithm  then  ApxA(c)  <  Sic)  for  all  c,  assuming  P  /  NP 
and  the  UGC. 

4.1.7  The  Critical  Curve,  S 

At  this  point  the  reader  might  wish  to  know  the  identity  of  this  critical  curve  S(c).  Unfor¬ 
tunately,  there  is  no  ‘nice’  formula  for  it.  Rather,  it  is  defined  as  follows: 

Sic)  =  inf  sup  valc§p(r).  (4.2) 

(l,po)-distributions  P  r:IR— >-[-1  1] 

with  mean  1  -  2c  increasing,  odd 

Not  all  of  the  expressions  above  have  even  been  defined  yet  —  in  particular  ‘(1,  po)-distribution’ 
(a  certain  simple  kind  of  probability  distribution  on  [-1,1])  and  lcSp'  (a  certain  infinite 
graph).  Further,  on  the  face  of  it  this  definition  does  not  look  very  ‘explicit’,  especially 
since  the  inf  and  sup  are  both  over  infinite  sets.  Nevertheless,  in  section  4.5  we  prove  the 
following: 

Theorem  4.1.12.  There  is  an  algorithm  that,  on  input  c  e  [|,1]  and  e  >  0,  runs  in  time 
poly(l/e)  and  computes  Sic )  to  within  ±e. 

We  believe  this  justifies  our  claim  that  S  is  ‘explicitly  given’.  A  brief  discussion  of  this 
point  appears  in  section  4.6.1. 

In  fact,  as  we  will  describe  in  the  next  section,  significant  portions  of  Sic)  can  be  de¬ 
scribed  or  estimated  more  simply.  For  c  >  .844,  Sic)  agrees  with  the  Goemans-Williamson 
SDP-approximation  curve,  ^  arccos(l  -  2c).  For  c  =  \  +  e,  Sic)  ~  \  +  \  •  e/ln(l/e)  up  to  lower- 
order  terms  (this  is  proved  in  Section  4.14,  tightening  the  asymptotics  of  [30,  102]).  A  plot 
ofS(c)  versus  c  appears  in  Section  4.15. 

4.1.8  Prior  Work 

Surveying  the  entirety  of  the  previous  work  on  approximation  algorithms,  SDP  gaps,  and 
hardness  results  for  Max  Cut  would  take  several  pages,  so  we  restrict  ourselves  to  briefly 
summarizing  the  best  results  known  prior  to  this  work. 

SDP  and  Dictator  Testing  gaps.  Combining  prior  work  of  many  authors  yields  the 
following: 
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1.  For  c  >  .844:  GapSDP(c)  =  GapASDP(c)  =  GapTest(c)  =  ^  arccos(l  -  2c). 

2.  For  c  =  7^+e:  GapSDP(c),  GapASDP(c),  and  GapTest(c)  all  have  asymptotics  |+0(c/ln(l/c)). 
As  can  be  seen,  this  already  pins  down  substantial  portions  of  these  curves  fairly  well.  In 
the  next  section  we  will  argue  the  merits  of  pinning  them  down  precisely. 

The  lower  bound  GapgDP(c)  >  ^  arccos(l  -  2c)  for  c  >  .844  is,  as  mentioned,  due  to  Goe- 
mans  and  Williamson  [59],  using  RPR2  with  the  rounding  function  sgn.  The  matching 
upper  bound  is  due  to  Feige  and  Schechtman  [50],  using  infinite  graphs  with  vertex  set 
Sn~ 1  and  edge  set  connecting  all  vectors  with  inner  product  at  most  1  -  2c.  The  lower 
bound  GapgDP(c)  >  \  +  f2(c/ln(l/c))  is  due  to  Charikar  and  Wirth  [30],  using  RPR2  with 
s-linear  rounding  functions,  as  suggested  by  Feige  and  Langberg  [48],  The  upper  bound 
GapgDP(c)  <  \  +  0(c/ln(l/c))  is  due  to  Khot  and  O’Donnell  [102],  using  mixtures  of  corre¬ 
lated  Gaussian  graphs  (described  in  section  4.2.2).  As  mentioned,  we  tighten  the  asymp¬ 
totics  of  the  previous  two  results  in  Section  4.14.  Finally,  Feige  and  Langberg  showed 
some  additional  numerical  lower  bounds  for  GapSDP(c),  via  RPR2  with  s-linear  rounding 
functions;  e.g.,  GapgDP(.6)  >  .5477. 

The  upper  bound  GapTest(c)  <  ^arccos(l  -  2c)  actually  holds  for  all  c  e  [|,  1];  this  was 
conjectured  by  Khot,  Kindler,  Mossel,  and  O’Donnell  [99]  and  proved  by  Mossel,  O’Donnell, 
and  Oleszkiewicz  [116].  The  ‘noise  sensitivity’  test  from  [99]  involves  choosing  x  e  [-1, 1}" 
uniformly  at  random  and  choosing  y  by  flipping  each  coordinate  of  x  with  probability 
c.  (As  we  will  discuss  in  section  4.10,  this  construction  is  quite  similar  to  one  intro¬ 
duced  by  Karloff  [86]  and  analyzed  further  in  [3,  4].)  The  upper  bound  GapTest(~  +  e)  < 

\  +  0(c/ln(l/c))  was  proved  by  Khot  and  O’Donnell  [102],  by  mixing  together  two  tests  of 
the  type  in  [99].  The  remaining  parts  of  the  above  statements  implicitly  follow  from  Khot 
and  Vishnoi  [107].  Interestingly,  although  proving  lower  bounds  for  GapTest(c)  is  a  very 
natural  problem  from  the  point  of  view  of  Property  Testing,  it  doesn’t  seem  to  have  been 
explicitly  been  considered  in  the  literature.  Indeed,  using  the  Khot-Vishnoi  result  is  a 
very  circuitous  way  to  prove  Dictator  Testing  lower  bounds.  We  discuss  this  point  further 
in  section  4.9. 

Algorithmic  hardness.  Early  results  on  algorithmic  hardness  involved  showing  up¬ 
per  bounds  on  the  approximation  curve  of  specific  algorithms.  In  particular,  work  of 
Karloff  [86],  Alon  and  Sudakov  [3],  and  Alon,  Sudakov,  and  Zwick  [4]  showed  that  for 
the  GW  algorithm,  ApxGW(c)  <  ^arccos(l  -  2c),  where  ApxGW(c)  denotes  the  expected  per¬ 
formance,  over  Z,  of  the  GW  algorithm.  Further,  this  result  holds  even  if  one  adds  all 
‘valid’  constraints  to  the  SDP  As  we  describe  in  section  4.10,  these  results  can  be  seen  as 
very  weak  forms  of  dictator-vs. -quasirandom  tests.  Feige  and  Schechtman  [50]  extended 
these  results  to  the  case  where  the  algorithm  can  pick  any  halfspace  cut  (although  only 
under  the  triangle  inequalities,  not  any  valid  constraints).  Assuming  the  UGC,  [99]’s  The¬ 
orem  4.1.11  implies  NP-hardness  of  achieving  approximation  curve  exceeding  GapTest(c). 
The  best  unconditional  NP-hardness  result  is  much  weaker:  Hastad  [76]  together  with 
Trevisan,  Sorkin,  Sudan,  and  Williamson  [138]  showed  that  achieving  Apx(^y)  >  is  NP- 
hard;  it  is  easy  to  translate  this  into  hardness  of  Apx(  |  +  c)  >  \  +  for  e  <  ||  and  hardness 
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of  Apx(l  -  e)  >  1  -  |e  for  e  <  ^j-. 

4.1.9  Comparison  with  Raghavendra’s  Result 

In  an  independent  powerful  work  obtained  by  Raghavendra  [125],  he  established  a  equiv¬ 
alent  relationship  (with  o(l)  slack)  between  GapSDP(c)  and  the  optimal  approximation 
Alg°Ptimal(c)  as  weell  as  GapTest(c)  for  almost  every  CSP  with  bounded  arity.  In  addition, 
he  also  gave  an  algorithm  of  calculating  the  GapgDP  with  running  time  exp(exp(G(l/e))) 
and  an  optimal  SDP  rounding  algorithm  (assuming  UGC)  with  running  time  poly(ra)- 
exp(exp(G(l/e)))  (also  see  [127]).  Compared  with  [125],  one  main  advantage  of  our  work 
on  the  problem  of  MAX  CUT  is  that  we  have  a  much  better  running  time  on  SDP  rounding 
and  SDP  gap  calculation.  This  allows  us  to  explicitly  determine  the  actually  value  of  the 
SDP  gap  as  well  as  the  optimal  approximation  curve  for  the  Max  Cut  problem.  In  addition, 
our  work  has  a  concrete  construction  of  the  worst  SDP  gap  instance  is:  it  is  certain  (1,  po) 
Gassuian  Mixture  graph,  which  will  be  defined  later. 


4.2  Proof  Overview 

In  this  section  we  describe  the  ideas  and  intuition  underlying  the  determination  of  GapgDP. 
By  the  end  of  the  section  we  will  also  have  defined  all  the  terms  necessary  for  the  defini¬ 
tion  (4.2)  of  the  curve  S(c). 

4.2.1  Embedded  Graphs 

The  first  idea  is  to  slightly  shift  the  way  one  looks  at  SDP  gaps  for  Max  Cut.  Usually 
one  thinks  of  first  finding  a  graph  G,  then  showing  Sdp(G)  is  large  and  Opt(G)  is  small. 
But  suppose  one  determines  that  Sdp(G)  is  large  for  some  graph  G;  then  one  may  as  well 
identify  G  with  its  optimal  SDP  embedding  on  the  sphere. 

Definition  4.2.1.  An  (n- dimensional)  embedded  graph  G  is  one  whose  vertex  set  V  is  a 
subset  of  S'1-1.  For  embedded  graphs,  we  explicitly  allow  self-loops.2  The  p-distribution  of 
the  embedded  graph,  denoted  P  =  P(G),  is  the  discrete  probability  distribution  on  [-1,1] 
given  by  the  distribution  of  u-v  when  ( u,v )  ~  E.  We  define  the  spread  of  G  (which  we  also 
call  the  spread  of  P)  to  be 

Spread(G)  =  Spread(P)  =  E  [|  -  ^p]  e  [0, 1]. 

p~p  z  z 

Thinking  about  embedded  graphs  leads  to  some  important  observations.  The  first  is 
that  we  can  symmetrize  any  SDP  gap  instance.  Specifically,  let  G  be  an  embedded  graph 
with  Spread(G)  =  c  and  Opt(G)  <  s.  Suppose  ©  is  any  rotation  of  space;  then  it  is  clear 
that  the  rotated  embedded  graph  <8 G  also  satisfies  Spread^G)  =  c  and  Opt(^G)  <  s,  and  is 
thus  an  equally  good  gap  instance.  Further,  if  one  takes  a  mixture  H  =  AG  +  (1- A)G'  of  any 
two  embedded  graphs  G  and  G'  with  Spread(G)  =  Spread(G')  =  c  and  Opt(G),Opt(G')  <  s, 

2Although  we  disallow  self-loops  in  Max  Cut  inputs,  we  allow  them  in  embedded  graphs.  One  reason  for 
this  is  that  there  is  no  guarantee  that  every  optimal  SDP  embedding  g  :  V  — -  «Sn_1  is  injective. 
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then  Spread(id)  is  again  c,  and  also  Opt (H)  <  s  by  a  simple  averaging  argument.  Hence 
we  can  average  an  SDP  gap  instance  G  over  all  rotations  of  space,  and  preserve  the  gap. 
When  we  do  this  we  get  an  ‘infinite  embedded  graph’  whose  vertex  set  is  all  of  S'1-1  and 
whose  edge  distribution  is  ‘symmetric’,  in  the  sense  that  the  density  on  the  pair  (u,v) 
depends  only  on  the  inner  product  u  ■  v.  In  fact,  the  ‘p-distribution’  of  the  symmetrized 
graph  is  precisely  the  original  p-distribution  P(G). 

Definition  4.2.2.  Let  P  denote  any  discrete  probability  distribution  on  [-1,1].  We  define 
the  d -dimensional  symmetric  embedded  graph  to  be  the  embedded  graph  with  vertex 
set  Sd_1  and  edge  distribution  over  Sd_1  x  Sd_1  given  by  drawing  a  random  pair  of  unit 
vectors  with  inner  product  p,  where  p  itself  is  drawn  from  P. 

Thus  we  have  reduced  the  search  for  graphs  with  large  SDP  gap  to  the  search  for  p- 
distributions  P  such  that  Spread(P)  =  c  (i.e.,  the  mean  of  P  is  l-2c)  but  OptC^^)  is  small. 
Indeed,  Feige  and  Schechtman’s  SDP  gap  instance  [50]  is  precisely  of  this  form;  roughly 
speaking,  they  take  P  to  be  the  distribution  with  all  of  its  mass  concentrated  on  1  -  2c. 

Unfortunately,  analyzing  Opt{.(/Jpl))  is  not  so  easy;  we  will  come  back  to  the  problem 
later.  For  now  let  us  move  to  the  algorithmic  side  of  things.  We  have  seen  that  we  can 
reduce  the  problem  of  finding  large  SDP  gaps  to  studying  symmetric  embedded  graphs. 
Can  we  similarly  reduce  the  problem  of  finding  large  cuts  in  arbitrary  graphs  to  studying 
symmetric  embedded  graphs?  The  observation  here  is  that,  in  some  sense,  this  is  just  what 
the  RPR2  algorithm  is  doing.  Consider  the  steps  of  the  algorithm  from  Definition  4.1.6. 
RPR2  algorithms  do  not  use  the  fact  that  the  SDP  solution  they  operate  on  is  optimal; 
hence  we  can  mentally  dispense  with  Step  1  (SDP)  and  view  RPR2  algorithms  as  simply 
taking  an  embedded  graph  G  as  input  and  trying  to  find  a  large  cut  in  it.  Next,  recalling 
that  the  d -dimensional  Gaussian  distribution  is  spherically  symmetric,  we  see  that  the 
RPR2  algorithm  can,  at  a  rough  level,  be  thought  of  as:  (i)  implicitly  constructing  the 
symmetrized  version  of  G;  and  then,  (ii)  outputting  the  ‘one-dimensional’  fractional  cut  r. 
We  will  make  this  idea  more  precise  in  the  next  section.  For  now,  we  note  that  if  RPR2 
algorithms  are  to  achieve  the  SDP  gap,  it  must  in  some  sense  be  the  case  that  optimal  cuts 
in  symmetric  embedded  graphs  are  ‘one-dimensional’.  The  key  to  our  determination 
of  GapSDP(c)  is  showing  that  this  statement  is  sufficiently  true. 

4.2.2  Gaussian  Mixture  Graphs 

By  now  our  analysis  is  heavily  dependent  on  understanding  Opt where  P  is  a  dis¬ 
tribution  with  mean  1  -  2c.  I.e.,  we  want  to  determine 

sup  E  E  [|  -  kh(u)-h(v)]. 

h:S^^ — *-[—1,1]  P~P  (u,v)~Sd~1xSd~1 
with  (u,v)  =  p 

This  is  somewhat  complicated  by  the  fact  the  distribution  on  vertices  —  i.e.,  the  uniform 
distribution  on  the  surface  of  the  sphere  —  is  not  a  product  distribution,  and  depends  in 
a  nontrivial  way  on  the  dimension  d.  It  is  possible  to  at  once  avoid  this  difficulty  and  hew 
much  more  closely  to  the  RPR2  framework  by  replacing  the  uniform  distribution  on 
by  the  d-dimensional  Gaussian  distribution. 
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Definition  4.2.3.  Let  P  denote  any  discrete  probability  distribution  on  [-1,1].  We  define 
the  d-dimensional  Gaussian  mixture  graph  to  be  the  probability  measure  on  x  Ud 
given  by  drawing  a  pair  of  ‘ p-correlated  d-dimensional  Gaussians’,  where  p  itself  is  drawn 
from  P.  In  the  case  d  =  1,  we  simply  write  <£ p .  By  p-correlated  d-dimensional  Gaussians  we 
mean  a  pair  (x,y),  where  x  is  a  standard  d-dimensional  Gaussian  and  y  ~  px+  \/l  -  p2Z, 
with  Z  being  another  d-dimensional  Gaussian  independent  of  x.  Note  that  this  distribution 
is  symmetric  in  x  and  y. 

Gaussian  mixture  graphs,  with  P  concentrated  on  1  and  were  introduced  in  [102] 
to  show  SDP  gaps  for  c  near 

Regarding  the  effect  of  switching  from  to  cd(p\  recall  that  the  Gaussian  distri¬ 
bution  in  a  high  dimension  d  is  very  similar  to  the  uniform  distribution  on  the  sphere  of 
radius  \fd.  Using  this  fact,  it  is  not  too  hard  to  show  that  when  Spread(P)  =  c  we  have 
Sdp(^pd))  >  c  -  Od(  1),  via  the  embedding  x  >-►  x/||x||.  Thus  we  can  equally  well  search  for 
SDP  gaps  based  on  Gaussian  mixture  graphs.  As  for  algorithms,  the  RPR2  framework 
now  has  a  very  simple  interpretation:  Given  an  embedded  graph  G  with  p-distribution  P, 
the  RPR2  algorithm  implicitly  converts  it  to  cdp  and  cuts  it  with  the  rounding  function  r. 
More  specifically,  the  expected  value  of  the  cut  produced  by  RPR2  on  graph  G  is: 


AISrpr2^) 


E 

z 


E  [|  -  \r{u  -Z)r(v  -Z)\ 


(u,v)~E 


E 

~P(G) 


E 

C x,y )  p-corr’d 
1-dim  Gaussians 


[b  -  br(x)r(y)] 


val^u  )(r).  (4.3) 


The  reader  can  now  see  that  given  G,  an  RPR2  algorithm  should  strive  to  take  r  to  be 
the  optimal  cut  r  :  K  — »  [-1, 1]  for  <S p  (i.e.,  This  leads  us  to  two  questions: 

1.  Can  we  algorithmically  determine  an  r  which  gives  a  near-optimal  cut  for  dtp? 

2.  Whether  or  not  we  can,  would  this  be  enough  to  match  the  SDP  gap?  In  other  words, 
is  it  true  that  for  all  p-distributions  P  with  spread  c  £  [|,  1], 

Opt PSP)  >  inf  Optl-^)?  (4.4) 

Here  the  left-hand  side  represents  what  we  hope  to  achieve  algorithmically  with  RPR2, 
and  the  right-hand  side  represents  the  upper-bound  on  GapSDP(c)  we  can  achieve  using 
Gaussian  mixture  graphs. 

Question  2  above  is  the  heart  of  the  matter;  we  describe  its  affirmative  answer  in  the 
next  section.  For  now,  let  us  discuss  Question  1.  Although  analytically  we  don’t  know 
the  optimal  cut  for  cSp,  there  is  a  feeling  that  one  could  algorithmically  find  an  r  coming 
within  e  of  the  optimum  by  using  the  Feige-Langberg  idea  of  trying  ‘all’  possible  r,  suitably 
discretized.  Indeed,  Feige  and  Langberg  wrote  that  if  one  only  considers  ‘well-behaved’ 
rounding  functions  r  (suggesting  piecewise  differentiable  functions  with  bounded  deriva¬ 
tives)  then  one  can  construct  a  collection  of  2poly(1/e)  many  discretized  rounding  functions 
such  that  one  of  them  achieves  a  cut  in  <£p  that  is  within  e  of  that  achieved  by  the  best 
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well-behaved  rounding  function. 


Unfortunately,  there  is  no  guarantee  that  the  optimal  cut  for  (dp  is  is  ‘well-behaved’. 
Even  if  it  were  guaranteed  to  be  piecewise  differentiable,  we  have  no  way  of  proving  that 
its  derivatives  don’t  depend  on  ‘n’\  i.e.,  the  number  of  points  in  P’s  support.  Thus  we  do  not 
know  of  any  way  of  efficiently  (in  n )  discretizing  the  search  space  for  the  optimal  rounding 
function  of  a  given  (.§p.  But  luckily,  in  the  next  section  we  will  see  that  for  the  ‘worst’  P, 
there  is  a  relatively  well-behaved  optimal  cut  r;  specifically,  there  is  an  increasing  optimal 
cut.  The  fact  that  increasing  functions  are  0(l/c)-Lipschitz  except  on  a  set  of  measure  e 
means  it  will  be  sufficient  to  discretize  the  set  of  rounding  functions  r  in  a  way  depending 
only  on  e  and  not  on  n.  Indeed,  our  actual  algorithm  for  finding  cuts  of  size  at  least  Sic)-e 
in  graphs  G  with  Sdp(G)  >  c  is: 

Algorithm  4.2.4.  Perform  the  RPR2  algorithm,  trying  out  all  2°(1/e  }  possible  ‘e-discretized’ 
rounding  functions  r. 

The  definition  of  ‘e-discretized’  is  given  in  section  4.4.  A  discussion  of  the  running  time, 
poly(|y|)-20(1/e  appears  in  section  4.6.2. 


4.2.3  Hermite  Analysis,  Minimax,  and  Borell’s  Gaussian  Rearrange¬ 
ment 

We  now  come  to  the  main  conceptual  part  of  the  determination  of  GapSDP,  namely  prov¬ 
ing  (4.4).  Suppose  we  could  show  that  for  every  P' ,  there  was  an  optimal  cut  f  for 
that  was  ‘one-dimensional’  —  i.e.,  of  the  form  fix)  =  r(u  -x),  where  r  :  R  — »  [-1, 1]  and  u  is 
any  unit  vector.  It’s  easy  to  see  that  the  value  of  f  in  CS^  is  just  val<^p,(r);  hence  we  would 
show  Opt i&p?)  =  Opt(^p'),  proving  (4.4).  Unfortunately,  we  do  not  know  whether  this  is 
the  case.  What  we  will  show,  though,  is  that  when  P'  is  the  ‘worst’  distribution,  cSpL)  has 
an  optimal  one-dimensional  (and  increasing,  as  promised)  cut. 


To  start,  we  take  advantage  of  our  switch  to  Gaussian  graphs;  this  allows  us  to  express 
the  value  of  cuts  /" :  — ►  [-1,1]  using  ‘Hermite  analysis’  (akin  to  Fourier  analysis  over 
(-1, 1}”).  Specifically,  given  a  cut  f  one  has 


val^,(/‘)=|-|pE, 


E  frs)Vsl 

SeNd 


(4.5) 


where  each  f(S)  e  IR  is  a  ‘Hermite  coefficient’,  and  |S|  denotes  Zf=1Si.  Using  this  formula 
one  can  easily  show  that  any  optimal  cut  f  may  as  well  be  odd;  i.e.,  satisfy  f(-x)  =  -  fix). 
Further,  when  f  is  odd,  the  sum  in  (4.5)  can  be  restricted  to  only  be  over  S’s  such  that  |S| 
is  odd. 


We  now  make  the  following  observation:  For  fixed  odd  f ,  the  expression  Sfip)  := 
L|SI  odd/('S)2P|S|  is  a  polynomial  in  p  (power  series,  actually)  with  nonnegative  coefficients 
and  only  odd  powers.  This  means  that  it  is  convex  for  p  >  0  and  concave  for  p  <  0.  Now  sup¬ 
pose  we  keep  f  fixed  but  vary  the  p-distribution  P,  subject  only  to  it  having  mean  1  -  2c. 
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Using  formula  (4.5),  one  sees  that  we  can  make  va Uw>(/)  as  low  as  the  value  of  the  con- 

vex  lower  envelope  of  \  -  |§/-(p)  at  1  -2c.  Further,  by  the  convexity/concavity  described, 
one  achieves  this  by  concentrating  all  of  P’s  probability  mass  on  at  most  two  points:  some 
negative  number  po,  and  possibly  also  1. 

Definition  4.2.5.  We  call  a  discrete  probability  distribution  P  on  [-1, 1]  a  (1,  po)-distribution 
if  P  puts  positive  probability  on  some  -1  <  po  <  0,  nonnegative  probability  on  1,  and  zero 
probability  on  all  other  values  in  [-1, 1]. 

These  considerations  suggest  that  the  Gaussian  mixture  graphs  with  lowest  Max  Cut 
are  those  based  on  (l,po)-distributions.  This  doesn’t  constitute  a  proof,  though,  because 
we  fixed  the  cut  and  the  graph  in  the  wrong  order:  we  are  supposed  to  fix  the  distribu¬ 
tion  P  first  and  then  choose  the  optimal  cut.  Ultimately,  though,  we  prove  that  (l,po)- 
distributions  are  the  worst  case  for  Gaussian  mixture  graphs  by  using  the  von  Neumann 
Minimax  Theorem:  we  can  reverse  the  order  of  fixing  the  distribution  and  the  cut  if  we 
allow  the  ‘cut  Player’  to  choose  a  distribution  on  cuts.  Fortunately,  the  convex  combination 
of  § f{p)  polynomials  has  the  same  convexity/concavity  properties  as  a  single  one,  so  the 
previous  argument  goes  through.  Unfortunately,  one  also  has  to  overcome  some  rather 
severe  discretization/compactness  complications  to  use  the  von  Neumann  Theorem  in  this 
infinitary  setting. 

At  this  point  we  essentially  have  that  the  Gaussian  mixture  graphs  with  smallest  Max 
Cut  are  those  based  on  (l,po)-distributions.  Finally,  we  are  able  to  deduce  that  in  such 
graphs  there  are  optimal,  one-dimensional,  increasing  cuts  through  the  use  of  Borell’s  re¬ 
arrangement  inequality  for  Gaussian  space  [24].  Borell’s  theorem  implies  that  for  p  e  [0, 1], 
the  quantity  §  f{p)  can  only  increase  if  one  ‘rearranges’  f’s  values  into  an  increasing,  one¬ 
dimensional  function.  If  G  =  c.dip)  is  a  Gaussian  mixture  graph  with  P  a  (1,  po)-distribution, 
then  formula  (4.5)  tells  us  that  waloif)  is  (up  to  an  additive  ^)  a  negative  linear  combina¬ 
tion  of  S/-(l)  and  Sf(po).  It  turns  out  that  S/-(l)  is  just  E[/2],  which  doesn’t  change  under 
rearrangement,  and  when  f  is  odd  S/-(po)  =  -§f(-po);  hence  Borell  implies  that  this  quan¬ 
tity  decreases  under  rearrangement.  This  proves  that  indeed  there  is  a  one-dimensional 
and  increasing  optimal  cut. 

Thus  we  establish  that  (4.4)  holds  and  that  the  right-hand  side  in  that  inequality  is 
precisely  S(c). 


4.2.4  Organizations  of  the  Remaining  Proof 

Above  is  just  a  high  level  overview  of  the  proof.  The  missing  part  is  organized  as  fol¬ 
lows:  The  construction  of  optimal  dictator-vs.-quasirandom  tests  from  Gaussian  mixture 
graphs  mimics  the  proof  of  the  Majority  Is  Stablest  theorem  using  the  ‘Invariance  Princi¬ 
ple’  from  [116];  the  poly(l/e)-time  algorithm  for  computing  Sic)  within  e,  promised  in  Theo¬ 
rem  4.1.12,  involves  combining  the  Karush-Kuhn-Tucker  conditions  with  Borell’s  theorem; 
and,  the  remaining  work  involves  careful  discretization  arguments. 


67 


4.3  GapSDP(c)  <  S(c ):  Hermite  Analysis  and  BorelPs  Re¬ 
arrangement 

In  this  section  we  prove  GapgDP(c)  <  S(c );  i.e.,  we  show  that  for  each  c  e  [^,1]  and  tj  >  0, 
there  exists  a  graph  G  exhibiting  a  large  SDP  gap:  Sdp(G)  >  c  and  Opt(G)  <  S(c)  +  tj.  We 
remind  the  reader  here  of  the  definition  of  S(c): 

S(c)  =  inf  sup  valc^p(r). 

(l,po)-distributions  P  r:IR— >-[-1  1] 

with  mean  1  -  2c  increasing,  odd 


4.3.1  SDP  Gaps  via  Gaussian  Mixture  graphs 

As  described  in  sections  4.2.2  and  4.2.3,  the  graphs  we  use  to  exhibit  SDP  gaps  will  be 
high-dimensional  Gaussian  mixture  graphs  based  on  (1,  po)-distributions.  Since  these  are 
infinite  graphs,  we  will  need  to  extend  a  number  of  our  basic  definitions,  including  ‘Sdp(G)’ 
and  ‘Opt(G)\  The  reader  may  object  that  these  will  not  proper  SDP  gap  examples  because 
the  graphs  are  infinite  and  also  have  self-loops  (one  might  even  object  that  the  graphs  are 
weighted).  However  in  Section  4.12  we  show  that  these  issues  can  be  circumvented: 
Proposition  4.3.1.  Suppose  G  =  is  a  Gaussian  mixture  graph  with  Sdp(G)  >  c  and 
Opt(G)  <  s.  Then  for  any  e  >  0,  there  is  a  finite,  self-loopless,  unweighted  graph  G'  (with 
n  =  (l/e)0(d)  vertices)  with  Sdp(G')  >  c  -  e  and  Opt  (GO  <  s  +  e. 

The  proof  of  this  proposition  essentially  only  uses  straightforward,  already-known 
ideas  [8,  50,  102].  The  reader  should  also  note  that  arbitrarily  small  losses  in  c  are  also 
immaterial,  since  we  can  show  (essentially  a  priori)  that  GapgDP(c)  is  continuous: 
Proposition  4.3.2.  The  function  GapgDP  is  continuous  on  [|,1],  and  strictly  increasing 
from  i  to  1. 

The  proof  of  this  proposition  is  in  Section  4.11. 


Extending  the  basic  MAX  CUT  definitions  to  infinite  graphs  is  quite  straightforward; 
see  [102].  Here  we  will  just  treat  the  special  case  of  Gaussian  mixture  graphs,  which 
require  a  little  extra  care  due  to  the  fact  that  they  can  have  ‘self-loops’.  To  begin,  we 
define  cuts  and  value  as  before:  A  (fractional)  cut  for  (Sp  )  is  any  measurable  function 
f  :Rd  —*  [-1,1],  and 


E 

(#,y)  p-corr’d 
e/-dim.  Gaussians 


\f{x)f{y)l 


Since  we  allow  ‘self-loops’  (i.e.,  P’s  with  probability  mass  on  1),  one  should  note  that  we 
can’t  necessarily  find  ‘proper’  cuts  with  value  at  least  that  of  fractional  cuts.  We  define 
Opt GS^)  to  be  the  supremum  of  the  value  over  all  fractional  cuts. 

Second,  we  define  Sdp(^pd))  essentially  as  in  the  SDP  (4.1): 

Sdp(^pd))=  sup  E  [b  -  bg(u)  •  g(v)]. 

g:Rd^Bd(u’v) 
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Some  comments  on  this  definition:  Again,  because  of  self-loops,  it  is  not  necessarily  true 
that  the  optimal  embedding  g  maps  into  the  surface  of  the  ball  Sd_1.  As  it  happens, 
though,  we  are  only  concerned  with  proving  lower  bounds  on  Sdp(^pd)),  and  the  embed¬ 
dings  we  will  use  happen  to  map  into  anyway  Second,  the  most  natural  definition  of 
Sdp(G)  for  an  ‘infinite  graph’  G  would  allow  embeddings  into  Bm  and  have  an  additional 
sup  over  m  e  N.  But  again,  we  will  end  up  only  considering  embeddings  IRd  — »  Sd~1  for 
so  we  choose  to  make  the  above  simpler  definition. 

Having  made  these  definitions,  the  goal  of  this  section  is  to  prove  the  following  two 
theorems: 

Theorem  4.3.3.  Let  G  =  be  a  d-dimensional  Gaussian  mixture  graph,  and  let  c  = 
Spread(P)  =  E  p~p[^-  |p].  Then  Sdp(G)  >  c-0(\/\ogd/d),  via  the  embedding  g  :Rd  -+  Sd_1 
mapping  x  to  x/||x||.3 

Theorem  4.3.4.  Let  G  =  <3^  be  a  d-dimensional  Gaussian  mixture  graph  for  which  P  is 
a  (1,  pofdistribution.  Then  the  optimal  fractional  cut  for  G  is  achieved  by  an  increasing, 
odd,  ‘one-dimensional’  cut;  i.e.,  a  function  s  :  tRd  — »  [-1,1]  of  the  form  s(x)  =  r{xf),  where 
r  :  [R  — *■  [-1, 1]  is  increasing  and  odd. 

Theorem  4.3.3  is  just  a  calculation;  the  heart  of  the  matter  is  Theorem  4.3.4. 

Before  proving  these  theorems,  let  us  see  how  together  they  imply  GapSDP(c)  <  S(c). 
Let  P  be  a  (l,po)-distribution  achieving  the  inf  in  the  definition  of  S(c)  to  within  e.  Now 
consider  G  =  CS^.  By  Theorem  4.3.3,  Sdp(G)  >  c  -  0{^/\ogd/d).  On  the  other  hand,  Theo¬ 
rem  4.3.4  implies  that 


Opt(G)  <  sup  valo(s). 

s:Rd— [-1,1] 

one-dimensional,  increasing,  odd 


But  when  s  is  one-dimensional,  s(x)  =  r(xi),  it’s  immediate  from  the  definitions  that  val(j(s)  = 
val^cu(r).  Thus  we  have  Opt(G)  <  S(c)  +  e. 

Having  determined  this  Gaussian  mixture  graph  G  with  Sdp(G)  >  c  -  0(  \J\ogdld)  and 
Opt(G)  <  S(c)  +  e,  we  are  essentially  done.  Using  Proposition  4.3.1  we  can  convert  G  to  a 
finite,  self-loopless  graph  G'  with  Sdp(G')  >  c  -  0{\/\ogd/d )  and  Opt(G)  <  S(c )  +  2c;  since 
e  >  0  is  arbitrary  this  proves  that  GapgDP(c  - 0(\J\ogdld))  <  S(c).  Now  by  the  continuity 
of  GapSDP  (Proposition  4.3.2),  we  conclude  that  GapSDP(c)  <  S(c). 

4.3.2  Proof  of  Theorem  4.3.3 

Theorem  4.3.3  Let  G  =  ^,fZ)  be  a  d-dimensional  Gaussian  mixture  graph,  and  let  c  = 
Spread(P)  =  E p~p[^-  |p].  Then  Sdp(G)  >  c-0(\/\ogd/d),  via  the  embedding  g  -Md  -* 
mapping  x  to  x/||x||. 

:iy(0)  can  be  set  arbitrarily. 
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Proof.  As  stated,  let  g(x)  =  x/||x||,  which  maps  onto  Sd  1.  (The  value  of  g(0)  may  be  set 
arbitrarily  since  the  probability  that  one  of  ^pd)’s  ‘edges’  involves  0  is  0.)  We  need  to  show: 


E  E 

p~P  (x,y)  p-corr’d 

d-dim,  Gaussians 


11  _y_ 

2  2  ||x||  '  ||y|| 


pEp^-i  p]-0(VhgdJd). 


Clearly  it  suffices  to  prove  the  following: 


for  all  p  e  [-1, 1],  E 

(x,y)  p-corr’d 
d-dim,  Gaussians 


X 

11*11 


<  p  +  0(\/\ogd/d). 


(4.6) 


This  can  be  considered  a  standard  probability  result.  Inside  the  expectation,  in  the  nu¬ 
merator,  we  have 

n 

xy=  ]2xiyi, 

i= 1 

and  the  summands  xtyt  are  i.i.d.  real-valued  random  variables.  The  expectation  of  Xjy,  is 
p,  and  the  variance  and  third  absolute  moment  are  bounded  by  absolute  constants.  Thus 
the  Berry-Esseen  theorem  implies  that  x-y  will  be  in  the  range  pd  ±  0(\/d logd)  except 
with  probability  at  most  0(l/\/d).  In  the  denominator,  it  is  well-known  (and  a  similar 
argument  shows)  that  ||x;||  and  ||y||  will  each  be  in  the  range  \fd  ±  0(  \/\ogd)  except  with 
probability  at  most  OillVd).  Hence  except  with  probability  at  most  0(l/\/d)  we  have  that 

x  y  pd  +  0(\/d\ogd)  r — — 

—  •  —  <  — - .  <  P  +  O^logd/d). 

11*11  llyll  (^d-O(^Iogd)XVd-O(Viogd)) 

Since  ]j§j[  ■  ]f^j[  is  bounded  above  by  1  always,  we  gain  at  most  0{1/Vd)  in  the  exceptional 
cases,  and  conclude  that  (4.6)  indeed  holds.  □ 


4.3.3  Proof  of  Theorem  4.3.4 

Before  proceeding  with  the  proof  of  Theorem  4.3.4  we  record  here  the  basic  facts  from 
‘Hermite  analysis’  we  will  use  throughout  this  work. 

The  space  of  functions  L2(Ud)  under  the  Gaussian  distribution  has  a  countable  or¬ 
thonormal  basis  given  by  products  of  normalized  Hermite  polynomials.  These  products 
are  indexed  by  vectors  S  e  [\ld;  we  use  the  notation  |S|  for  which  is  also  the  de¬ 

gree  of  the  product  polynomial  Hs .  We  can  express  any  such  function  f  via  its  ‘Hermite 
expansion’, 

f(x)=  £  f(S)Hsix), 

SeNd 

with  convergence  in  L2-norm.  We  make  frequent  use  of  the  following  definition: 
Definition  4.3.5.  Given  f  e  L2(IRd)  and  p  e  [-1, 1],  the  noise  stability  of  f  at  p  is 

S  p(f)=  E  [f(x)f(y)l 

(x,y)  p-corr’d 
d-dim.  Gaussians 
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(Note  that  we  reversed  the  notational  position  of  p  and  g  in  section  4.2.3  for  clarity 
of  exposition.)  The  following  basic  facts  about  Hermite  expansions  are  well  known;  it  is 
essentially  the  Fourier  Analysis  in  Chapter  2  with  (Q,p)  taken  to  be  the  Gaussian  Dis¬ 
tribution  (though  the  dimensionality  is  infinity  now  See,  e.g.,  [102]  and  the  references 
therein. 

Proposition  4.3.6. 

I-  §p(0  =  LseN«p'S'f(S)2. 

2.  =  LseNd  AS)2  =  E[/2]. 

3.  If  f  is  an  odd  function  (i.e.,  f(-x)  =  -  fix)),  then  f{S)  =  0  unless  |S|  is  odd. 

4.  If  f  is  an  odd  function  then  § ~pif)  =  -S pif). 

We  also  immediately  deduce  the  following  fact: 

Proposition  4.3.7.  Assume  f  is  an  odd  function.  Then  as  a  function  of  p,  S p(f)  is  a  power 
series  with  nonnegative  coefficients,  odd  powers  of  p  only,  and  radius  of  convergence  at  least 
1.  In  particular  it  is  an  odd  function  of  p,  strictly  increasing  on  [-1, 1],  0  at  0,  concave  on 
[-1,0],  and  convex  on  [0, 1]. 

We  now  proceed  with  the  proof: 

Theorem  4.3.4  Let  G  =  be  a  d-dimensional  Gaussian  mixture  graph  for  which  P  is 
a  (1,  po)- distribution.  Then  the  optimal  fractional  cut  for  G  is  achieved  by  an  increasing, 
odd,  ‘one-dimensional’  cut;  i.e.,  a  function  s  :  — > ►  [-1,1]  of  the  form  s(x)  =  r(x i),  where 
r  :  [R  — *■  [-1, 1]  is  increasing  and  odd. 


Proof.  Suppose  P  has  weight  p  on  the  point  -1  <  po  <  0  and  weight  1-p  on  the  point  1. 
Let  ( ft )  be  a  sequence  of  measurable  fractional  cuts,  ft  :Ud  -*  [-1, 1],  for  which  val  Gifi)/ 
Opt(G).  We  have 


val  G(fi)=  E 

P~p 


E 

(*,y)  p-corr’d 
d-dim.  Gaussians 


[5 


lfi(x)fi(y)\  =  \ 


£  E  [Sp(fi)\, 


and  hence 

1  -  2val  G(fi)  =  a-p)§i(fi)  +  pSP0(fi).  (4.7) 

Consider  now  replacing  ft  by  /’°dd,  the  function  — »  [-1,1]  given  by  /Tdd(x)  =  ( fi(x )- 
fi(-x))/ 2.  It  is  well  known  that  f°dd(S)  equals  f;(S)  for  odd  |S|  and  is  0  for  even  |S|. 
Thus  when  we  make  this  replacement,  Si  iff)  =  Ls/j(*S)2  only  decreases,  and  similarly 
Sp0 (.fi)  =  Lsfi(S)2p^  only  decreases  (using  the  fact  that  po  <  0).  Thus  (4.7)  only  de¬ 
creases,  and  hence  val g(/i)  can  only  increase.  Thus  we  may  assume  each  ft  is  odd. 

Given  this  assumption  and  using  Proposition  4. 3. 6. 4, 

l-2val  G(fi)  =  (4.7)  =  (1  -p)E[/‘2]  -pS_Po(/‘,).  (4.8) 
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We  now  appeal  to  the  Gaussian  rearrangement  inequality  of  Borell  [24],  which  implies 
that  for  any  function  /)  eL2(Ud)  and  any  nonnegative  p, 

sp(fi)<  v/7); 

here  f*  is  the  Gaussian  rearrangement  of  fi,  an  increasing,  one-dimensional  function.4 5 
Suppose  then  we  replace  each  ft  by  f*.  Since  it  holds  that  E[(/*)2]  =  E iff],  the  first  term 
in  (4.8)  does  not  change.  But  -po  is  nonnegative,  so  we  can  use  Borell’s  result  to  conclude 
that  the  second  term  § -Po(fi)  only  increases.  Hence  (4.8)  only  decreases  under  Gaussian 
rearrangement  and  thus  val G(fi)  only  increases.  Thus  we  may  replace  all  of  the  ff  s  by 
their  Gaussian  rearrangements.  Note  that  an  odd  function,  when  rearranged,  is  still  odd. 

We  now  have  a  sequence  of  one-dimensional,  odd,  increasing  functions  r  j :  [R  — ►  [-1, 1], 
with  val g(g)  /*  Opt(G)  (we  abuse  notation  here  slightly  instead  of  writing  val^ds;)  where 
Si  :  — *■  [-1,1]  is  defined  by  Si(x)  =  r(xi)).  It  is  well  known  that  using  a  Helly-type 
proof  we  can  pass  to  a  subsequence  that  converges  a.e.  to  an  increasing,  one-dimensional 
function  r,  which  must  also  be  odd.  Dominated  convergence  then  implies  that  vale!?')  = 
Opt(G).  □ 

4.4  GapSDP(c)  >  S{c ):  Discretized  RPR2  and  Minimax 

In  this  section  we  show  that  GapSDp(c)  -  S(c).  As  described  in  section  4.2.2,  the  idea  will 
be  to  randomly  find  cuts  in  a  given  embedded  graph  by  trying  the  RPR2  algorithm  with 
‘all’  increasing,  odd  rounding  functions.  Of  course,  we  actually  only  try  ‘all’  such  functions 
up  to  some  discretization.  Specifically: 

Definition  4.4.1.  Given  e  >  0,  let  denote  the  partition  ofU\  {0}  into  intervals, 


=  [±(-oo,  -B],  ±(-B,  -B  +  ez],  ±(-B  +  ez,  -B  +  2ez], . . . ,  ±(-2e 2,e2],  ±(-ez,ez)} 


„2  J2\ 


where  B  =  B(e)  is  the  smallest  integer  multiple  of  e2  exceeding  \J2  ln(l/e).  We  say  that  a 
function  r  :  [R  — > ►  [-1, 1]  is  e-discretized  if  the  following  hold: 

•  r  is  identically  -1  on  (-oo,  -B\  0  at  0,  and  identically  1  on  [R,oo). 

•  r’s  values  on  the  finite  intervals  in  are  from  the  set  el  n  (-1, 1). 

Note  that  the  number  of  different  e-discretized  r’s  is  2 °(1/<r  \ 

The  main  theorem  we  prove  in  this  section  is  the  following: 

Theorem  4.4.2.  There  is  a  universal  constant 5  K  <  oo  such  that  for  all  ce[|,  1], 

inf  max  val^p(r)  (4.9) 

discrete  dists  P  on  [-1, 1]  e-discretized  r:IR— >-[-1,1] 
with  mean  1  -  2c  increasing,  odd 

4Borell  only  proves  this  for  fi  Lipschitz  and  nonnegative,  but  both  conditions  are  inessential;  the  first 
can  be  removed  by  standard  approximation  arguments  and  the  second  simply  by  adding  a  sufficiently  large 
constant.  Alternatively,  one  can  use  the  alternate  proof  of  Borell’s  theorem  due  to  Beckner  [  18], 

5  In  future  results  in  this  section,  different  K’s  may  have  different  values;  however  they  never  depend  on 
c  or  e. 
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is  within  ±Ke  of 


S{c ) 


inf 

(1,P0>- distributions  P 
with  mean  l-2c 


sup  val^p(r). 

r:K— [-1,1] 


increasing,  odd 


Aside  from  discretization  issues,  the  main  idea  here  is  using  Hermite  analysis  and  the 
von  Neumann  Minimax  Theorem  to  show  that  ‘worst’  p-distribution  is  a  (1,  po)-distribution. 
Incidentally,  the  discretization  issues  are  not  just  necessary  because  we  want  a  finitary  al¬ 
gorithm;  in  fact,  discretization  is  also  necessary  for  the  employ  of  the  Minimax  Theorem 
(which  also  requires  a  finitary  setting,  or  at  least  some  kind  of  continuity  and  compact¬ 
ness). 

Let  us  explain  how  we  can  use  Theorem  4.4.2  algorithmically: 

Theorem  4.4.3.  Let  G  be  any  (discrete)  embedded  graph  with  spread  c.  If  we  run  Algo¬ 
rithm  4.2.4  on  G,  trying  RPR2  on  G  with  all  possible  increasing,  odd  e-discretized  rounding 
functions  r,  then  at  least  one  will  achieve,  in  expectation,  a  cut  of  value  at  least  S  (c)  -  0(e). 
In  particular,  there  exists  a  cut  in  G  with  value  at  least  S(c). 

Proof.  Given  any  r,  the  observation  (4.3)  from  Section  4.2.2  implies  that  AlgRPR2(G)  = 
val^p(r).  Thus  the  suggested  algorithm  achieves  at  least  (4.9),  which  by  Theorem  4.4.2  is 
at  least  S(c)-Ke.  As  for  the  last  statement  in  the  theorem,  we’ve  in  particular  shown  that 
there  exists  some  cut  fe  :  V  — »  [-1, 1]  with  value  at  least  S(c)-Ke.  Taking  e-»0we  can  get 
a  sequence  of  cuts  fi  with  limsupvalcd/))  ^  S(c).  But  since  each  cut  is  just  a  point  in  the 
compact,  finite-dimensional  cube  [-l,l]|y|  and  since  valel-)  is  continuous,  we  can  extract 
a  limiting  cut  f  with  value  at  least  S(c).  □ 

Corollary  4.4.4.  For  each  c  £  [^,1]  it  holds  that  GapSDP(c)  >  S(c).  Indeed,  there  is  an 
algorithm  which,  given  any  graph  G  with  Sdp(G)  >  c  and  any  e>0,  runs  in  time  polyflVj)- 
2 °<1/e  }  and  with  high  probability  outputs  a  proper  cut  in  G  with  value  at  least  S  (c)  -  e. 

Proof.  Given  G,  we  can  solve  the  semidefinite  program  and  find  an  isomorphic  embedded 
graph  G'  with  spread  at  least  c.  It  is  quite  easy  to  decrease  the  spread  of  an  embedded 
graph  arbitrarily;  for  example,  map  each  x  £  S'1-1  to  (tx,V  1  - 1 2)  £  Sn  for  ate  [0,1]  of 
one’s  choosing.  Thus  we  may  assume  that  G1  has  spread  exactly  c.  Now  the  algorithm 
from  Theorem  4.4.3  (which  has  the  dominating  running  time  stated)  is  used  to  obtain 
a  cut  with  value  at  least  S(c)  -  0(e).  As  e  >  0  can  be  arbitrarily  small,  this  establishes 
GapSDP(c)  >  S(c). 

Some  minor  algorithmic  details  are  discussed  more  carefully  in  Section  4.13.  One  we 
need  to  mention  explicitly  is  that  our  algorithm  cannot  solve  the  SDP  exactly.  Instead,  we 
can  use  it  to  find  an  isomorphic  graph  with  spread  exactly  c-e2.  Then  the  algorithm  will 
find  a  cut  with  value  at  least  S(c-e2)-0(e).  Since  we  now  know  S  =  GapSDP,  we  can  inspect 
the  proof  of  Proposition  4.3.2  and  conclude  that  S(c-e2)  >  S(c)-0(e2)  if  c  is  bounded  away 
from  1,  and  we  can  use  the  fact  that  GapSDP(l-<5)  =  l-arccos(-l  +  2d)/;r  =  l-©(\/5)  (from 
Goemans-Williamson)  to  conclude  that  S(c-  e2)  >S(c)-  0(e)  if  c  is  close  to  1.  □ 
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We  discuss  the  issue  of  the  running  time’s  dependence  of  e  in  section  4.6.2. 

Combining  Corollary  4.4.4  with  the  results  of  section  4.3  completes  the  proof  that 

GapSDP(c)  =  S(c). 

The  remainder  of  this  section  is  devoted  to  proving  Theorem  4.4.2.  The  proof  will  proceed 
by  transforming  (4.9)  into  S(c)  in  several  steps.  Each  step  will  modify  the  range  of  either 
the  inf  or  sup,  while  changing  the  overall  value  by  at  most  Ke. 


4.4.1  Discretizing  Distributions 


The  first  step  involves  showing  we  can  discretize  the  distributions  P  appearing  in  (4.9). 
This  will  facilitate  our  application  of  the  Minimax  Theorem. 

Definition  4.4.5.  Let  c  e  [^,  1]  be  given  and  fixed.  We  say  that  a  discrete  distribution  P  on 
[-1, 1]  is  p-discretized  if  its  support  is  contained  in  vfL  u  {-1, 1}. 

Lemma  4.4.6.  There  is  a  universal  constant  K  <  oo  such  that  for  each  ce[^,l], 

(4.9)  =  inf  max  valc§p(r) 

discrete  dists  P  on  [-1,1]  e-discretized  r:R— »[-l,l] 
with  mean  1  -  2c  increasing,  odd 


is  within  ±Ke  of 

inf  max  val<^p(r).  (4.10) 

P -discretized  dists  P  e-discretized  r:R— >[-1,1] 

with  mean  1  -  2c  increasing,  odd 


Proof.  In  fact,  (4.10)  is  clearly  at  least  (4.9),  since  the  inf  is  over  a  smaller  set.  To  show  the 
difference  is  at  most  0(e)  it  suffices  to  show  that  every  discrete  distribution  P  on  [-1,1] 
with  mean  1  -  2c  can  be  converted  into  an  e7-discretized  distribution  P'  with  mean  1  -  2c 
such  that 

| valc^, (r)  -  val^p, (r) |  <  0(e) 


o 


E 

p~p 


E  [r(x)r(y)] 

( x,y )  p-corr’d 
Gaussians 


E 

p~P’ 


E 

(. x,y )  p-corr’d 
Gaussians 


[r(x)r(y)] 


<0(e) 


holds  for  for  every  e-discretized,  increasing,  odd  r. 


(4.11) 


The  conversion  of  P  to  P'  proceeds  as  follows.  For  each  atom  pi  of  P,  choose  p'.  <  p” 

7  il 

to  be  the  two  values  in  e  Zu{-1, 1}  which  straddle  pi  as  closely  as  possible.  Write  also 
Pi  =  Xipl  +  (1  -  A i)p”,  A i  e  [0, 1].  We  form  P'  be  replacing  each  atom  pi  with  probability 
mass  pi  in  P  with  the  pair  of  atoms  p'-,  p"  with  masses  pi\i,  pi(l  -  A;),  respectively.  We 
have  that  P'  is  indeed  an  e7-discretized  distribution  with  the  same  mean  as  P,  namely 
1  -  2c. 


Note  that  |  p'-  -  Pi\,\p”  -  pi\  ^e7  always.  It’s  easy  now  to  see  that  (4.11)  will  follow  if  we 
can  show 


E  \r(x)r(y)] 

{ x,y )  p'-corr’d 
Gaussians 


E 

( x,y )  ppcorr’d 
Gaussians 


[r(x)r(y)] 


<0(e) 


(4.12) 
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holds  for  all  e-discretized  increasing  odd  r,  using  only  | p'.  -  pi\  <  e7.  Now  the  left  side 
of  (4.12)  is  equal  to  | §p/(r)-§Pi(r)|,  and  r  here  is  odd.  Thus  by  the  increasing/concavity/convexity 
properties  of  §p(r)  given  in  Proposition  4.3.7,  we  immediately  see  that  the  largest  possible 
of  |Sp'(r) -  §Pi(r)|  value  would  occur  when  pi  =  1  and  pi  =  1  - e7  (or  equivalently,  pi  =  -1, 

Pi  =  -1  +  e7).  Thus  the  proof  of  (4.12)  and  hence  the  theorem  follows  Claim  4.4.7  below.  □ 

Claim  4.4.7.  For  every  fixed  e-discretized,  increasing,  odd  r, 


E  [r(x)r(y)  ] 

( x,y )  1  -corr’d 
Gaussians 


E 

( x,y )  (1  -eI)-corr’d 
Gaussians 


[r(x)r(y)  ] 


<  0(e). 


Proof.  Write  rj  =  e7.  Since  1-correlated  Gaussians  are  identical,  we  are  comparing 


E  [r(x)r(y)  ] 

(. x,y )  (l-P)-corr’d 
Gaussians 

with  E[r(x)2].  Using  the  fact  that  r  is  e-discretized,  it  suffices  to  show  that  when  ( x,y )  is  a 
pair  of  (1  -  p)-correlated  Gaussians,  the  probability  that  x  and  y  land  in  different  intervals 
from  (recall  Definition  4.4.1)  is  at  most  0(e).  We  will  first  give  up  on  the  half- infinite 
intervals  in  ./f; ;  using  the  fact  that  x  and  y  are  both  individually  distributed  as  Gaussians, 
the  probability  that  either  of  them  ends  up  at  least  B  >  \/2  ln(l/e)  in  absolute  value  is  at 
most  0(e)  anyway.  Also,  the  probability  that  either  lands  on  0  is  0.  It  remains  to  consider 
the  intervals  of  the  form  I  =  [t,t  +  e2),  where  0  <t<B  (the  case  of  negative  intervals  will 
be  the  same).  The  probability  density  function  for  x  is  nearly  constant  over  the  interval 
I ;  in  particular,  the  ratio  between  its  values  at  t  and  t  +  e2  is  exp(e2t  +  e4/2),  which  is  close 
to  1  (since  t  <  B  =  0(y/ log(  1/e))).  Even  just  using  that  it  is  at  most  2,  we  conclude  that 
conditioned  on  x  falling  into  I,  the  probability  that  x  falls  into  [£  +  2c3,  t  +  e2  -  e3]  is  at  least 
1  -  0(3e3/e2)  =  1  -  0(e). 


By  losing  0(e)  probability,  we  will  assume  this  happens.  In  this  case,  y  is  distributed 
as  (1  - tj)x  +  y/l-(l-p)2N(0, 1),  where  N( 0, 1)  is  a  standard  normal.  Note  that  (1  - tj)x  = 
x-r/x  >  x-r]B  >  x-e3,  since  px  <  e7B  «  e3.  Hence  we  have  (l-p)x  £  [t  +  e3 ,t  +  e2 -e3].  Given 
this,  the  conditional  probability  that  y  won’t  also  fall  into  I  is  at  most  the  probability 
that  \/ 1  —  (1  —  p)2N( 0, 1)  will  exceed  e3  in  absolute  value.  But  the  standard  deviation  of 
this  normal  is  O(yTj)  =  0(e35),  so  the  probability  it  will  exceed  e3  in  absolute  value  is 
exponentially  small  in  e,  certainly  smaller  than  0(e).  Thus  we’ve  shown  that  except  with 
probability  at  most  0(e),  x  and  y  will  fall  into  the  same  interval  from  J^,  and  this  completes 
the  proof  of  the  claim.  □ 


4.4.2  Minimax 

The  next  step  in  the  proof  of  Theorem  4.4.2  is  to  reinterpret  the  space  of  c7-discretized 
distributions  P  with  mean  1  -  2c: 

Fact  4.4.8.  Any  e7 -discretized  distribution  P  with  mean  1-2 c  can  be  expressed  as  a  convex 
combination  of  2-point  e7 -discretized  distributions  each  with  mean  l-2c  (and  vice  versa, 
clearly). 
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Here,  by  a  ‘2-point  distribution’  we  mean  one  whose  support  is  on  at  most  two  points 
(i.e.,  either  one  or  two  points). 


Proof.  This  fact  can  be  considered  standard.  One  proof  sketch  is  the  following:  Given  any 
e7-discretized  P  with  mean  1  -  2c,  pick  any  two  points  which  straddle  1  -  2c  and  on  which 
P  has  positive  probability  mass  (the  two  points  may  coincide  in  case  P  has  mass  on  1  -  2c). 
Such  a  pair  must  exist  because  P  has  mean  1  -  2c.  Take  the  mean-(l  -  2c)  probability 
distribution  over  this  pair  and  ‘remove  it  from  P’  (i.e.,  subtract  and  rescale)  to  the  greatest 
extent  possible.  This  will  preserve  the  mean  of  P  being  1  -  2c,  and  it  will  also  cause  P  to 
have  support  on  (at  least)  one  fewer  point.  Repeat  this  process  until  P  is  empty;  the  pairs 
extracted  give  the  required  combination  of  2-point  distributions.  □ 

The  next  step  is  to  reverse  the  inf/min  and  max  in  (4.10)  using  the  von  Neumann 
Minimax  theorem. 

Lemma  4.4.9. 

(4.10)  =  min  max  vale gp(r) 

-discretized  dists  P  e-discretized  r:IR— »[— 1,1] 
with  mean  1-2 c  increasing,  odd 

=  max  min 

probability  distributions  R  over  2-point  e1 -discretized  dists  P 
e-discretized,  increasing  with  mean  1  -  2c 

odd  r:K— *[-1,1] 

Proof.  Note  that  (4.10),  which  has  an  inf,  is  not  precisely  the  same  as  (4.13),  which  has  a 
min.  We  will  show  that  (4.10)  equals  (4.14)  using  the  Minimax  theorem.  Since  a  corollary 
of  the  Minimax  theorem  is  that  the  infs  and  sup’s  involved  are  achieved,  this  will  imply 
that  (4.10)  is  equal  to  (4.13)  and  that  we  can  write  min  and  max  everywhere. 

Consider  a  zero-sum  game  between  a  ‘Distribution  Player’  and  a  ‘Function  Player’. 
Acting  simultaneously,  the  Distribution  Player  chooses  a  2-point  e7-discretized  probabil¬ 
ity  distribution  P  with  mean  1  -  2c,  and  the  Function  Player  chooses  an  increasing,  odd, 
e-discretized  r  :  IR  — ►  [-1, 1],  The  payoff  is  val<^p(r)  to  the  Function  Player  from  the  Distri¬ 
bution  Player. 

Note  that  both  players  choose  from  a  finite  set  of  strategies;  for  the  Distribution  Player, 
this  uses  the  fact  that  for  any  pair  of  discretized  points,  there  is  at  most  one  distribution 
with  mean  1  -  2c  supported  on  this  pair.  Therefore  we  may  apply  the  von  Neumann  Mini¬ 
max  theorem.  We  conclude  that  the  game  has  some  value,  which  is  achieved  in  both  of  the 
following  scenarios:  (a)  the  Function  Player  goes  first  and  gets  to  choose  a  mixed  strat¬ 
egy,  and  then  the  Distribution  Player  goes  second  and  gets  to  choose  a  pure  strategy;  and, 
(b)  the  Distribution  Player  goes  first  and  gets  to  choose  a  mixed  strategy,  and  the  Function 
Player  goes  second  and  gets  to  choose  a  pure  strategy.  The  value  in  (a)  is  clearly  (4.14).  As 
for  the  value  in  (b),  we  claim  it  equals  (4.13).  This  follows  from  Fact  4.4.8,  along  with  the 
fact  that  if  we  identify  a  P  with  a  convex  combination  of  2-point  distributions  Q,  then  for 


(4.13) 

E  [val^p(r)].  (4.14) 
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any  r 


E  [val<?„  (r)] 
Q~P  Q 


EE  E  [§  -  |r(x)r(y)] 

Q~P  p~Q  (x,y)  p- corr’d 
Gaussians 

=  E  E 

p~P  ( x,y )  p- corr’d 

Gaussians 


|r(x)r(y)] 


Hence  (4.13)  equals  (4.14)  and  the  proof  is  complete. 


valc^p(r). 


□ 


4.4.3  More  Minimax;  Convexity  and  Concavity 

In  the  next  step,  we  use  the  special  properties  of  § p(r)  for  odd  r  given  in  Proposition  4.3.7, 
along  with  further  Minimax-based  reasoning,  to  deduce  that  the  ‘Distribution  Player’  es¬ 
sentially  may  as  well  use  a  (l,po)-distribution.  This  idea  was  discussed  in  section  4.2.3. 
Definition  4.4.10.  We  say  an  e1  -discretized  distribution  P  is  almost-(l,po)  if  it  is  the  mix¬ 
ture  of  two  (1,  pofdistributions  for  which  the  two  po  values  are  neighboring  (or  equal)  dis¬ 
cretized  values. 

Lemma  4.4.11. 


(4.13): 


mm 

e7 -discretized  dists  P 


with  mean  1  -  2c 


max  val^p(r) 

e-discretized  r:R— >[-1,1] 
increasing,  odd 


mm 

e7 -discretized  almost-(l,  po)-dists  P 
with  mean  1  -  2c 


max 

e-discretized  r:R— >[-1,1] 
increasing,  odd 


val  c§p(r).  (4.15) 


Proof  Let  P*  denote  an  e7-discretized  distribution  with  mean  1  -  2c  achieving  the  min 
in  (4.13);  i.e.,  an  optimal  mixed  strategy  for  the  Distribution  Player.  Let  R*  denote  a  dis¬ 
tribution  over  e-discretized,  increasing,  odd  r  achieving  the  max  in  (4.14);  i.e.,  an  optimal 
mixed  strategy  for  the  Function  Player.  The  Minimax  Theorem  further  implies  that  P*  is 
an  optimal  strategy  for  the  Distribution  Player  given  that  the  Function  Player  uses  R* . 
I.e.,  P*  is  a  minimizing  choice  for  P  in  the  following: 


Now 

E  [valc^p(r)] 

r~R * 


min 

e7 -discretized  dists  P 
with  mean  1  -  2c 


E  [va L?p(r)]. 

r~R*  F 


EE  E  -  |r(x)r(y)] 

r~R*  p~P  ( x,y )  p-corr’d  A  z 
Gaussians 


1  _  1 
2  2 


E  E  [Sp(r)], 

p~Pr~R* 


and  so  it  follows  that  P*  is  a  maximizing  choice  for  P  in  the  following: 


max  E  E  JSp(r)]. 

e7 -discretized  dists  P  f>~P  >  R  * 
with  mean  1  -  2c 

Suppose  we  fix  a  particular  odd  r.  We  now  have  the  special  properties  of  S p(r)  as  a 
function  of  p  given  in  Proposition  4.3.7.  We  also  claim  that  the  convexity  and  concavity 
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of  this  function  are  essentially  strict ;  i.e.,  § p(r)  is  not  linear  on  any  open  interval.  For 
otherwise,  by  analyticity,  -j^§p(r)  would  have  to  be  0  everywhere  on  [-1, 1],  implying  that 

r  is  equal  (in  the  L2  sense)  to  a  linear  function.  But  an  e-discretized  function  cannot  be 
linear,  since  it  is  constantly  -1  on  (-oo,  -B]  and  constantly  1  on  LB,oo). 

Next,  note  that  all  of  the  properties  mentioned  in  Proposition  4.3.7  are  maintained 
under  finite  convex  combinations,  in  particular  because  first  and  second  derivatives  are 
linear.  Hence  if  we  define 

q(p)=  E  [§p(r)], 

r~R*  p 

we  conclude  that  q(p)  is  also  an  odd  function  of  p,  strictly  increasing  on  [-1,1],  0  at  0, 
concave  on  [-1,0],  convex  on  [0, 1],  and  not  linear  on  any  open  interval.  An  illustration  of 
what  q  may  look  like  is  given  in  Figure  1. 


Figure  4.1:  Illustrative  q(p),  with  least  concave  upper  bound 

q(p)- 


Recall  now  that  P*  is  a  maximizing  choice  for  P  in 

max  E  [q(p)]. 

e1  -discretized  dists  P  p~P 
with  mean  1  -  2c 

To  complete  the  proof,  we  will  show  that  this  forces  P*  to  be  almost-(l,po).  Suppose  we 
first  disregard  the  constraint  of  being  e7-discretized.  Then  it  is  easy  to  see  that  the  max¬ 
imum  value  in  the  above  is  equal  to  q(  1  -  2c),  where  q  denotes  the  least  concave  upper 
bound  of  the  function  q.  We  have  that  q  equals  q  on  some  interval  [-l,po],  where  po  <  0, 
and  is  a  straight  line  joining  q(po)  and  q(  1)  on  [po,  1].  Further,  in  this  case  there  would  be  a 
unique  maximizing  P* :  either  the  1-point  distribution  concentrated  on  1  -2c,  if  1  -2c  <  po, 
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or  the  (1,  po)-distribution  with  mean  1  -  2c,  if  1  -  2c  >  po- 

Now  we  reintroduce  the  constraint  that  P*  must  be  e7-discretized.  Let  q  denote  the 
piecewise  linear  function  which  interpolates  q’s  values  on  the  discretized  points  e7Z.  We 
now  have  that  the  maximum  value  of  Ep~p[g(p)]  is  equal  to  g(l-2c),  where  again  q  is 
the  least  concave  upper  bound  of  q.  The  function  q  is  still  odd,  strictly  increasing,  concave 
on  [-1,0],  and  convex  on  [0,1];  hence  again  the  function  q  equals  q  on  some  interval 
[— 1, po],  where  po  <  0,  and  is  a  straight  line  joining  q(po)  and  g(l)  on  [po,l].  The  only 
difference  now  is  that  the  point  po  is  not  necessarily  unique;  there  may  be  two  consecutive 
possibilities,  if  the  ‘secant’  at  one  of  the  possible  po’s  is  parallel  to  one  of  the  line  segments 
touching  q(po).  (Note  that  there  cannot  be  more  than  two  possible  po’s,  since  otherwise 
the  graph  of  q  would  have  three  distinct  collinear  points  on  [-1,0]  and  would  thus  be 
linear  on  some  open  interval.)  We  conclude  that  any  maximizing  P*  must  have  all  of  its 
support  among  1  and  the  (at  most)  two  discretized  values  that  straddle  po;  i.e.,  P*  must 
be  almost-(l,po).  □ 

Finally,  we  can  convert  almost-!  1,  po)-distributions  to  (l,po)-distributions: 

Lemma  4.4.12.  There  is  a  universal  constant  K  <  oo  such  that  for  each  c  e  [|,  1], 

(4.15)  =  min  max  val<^p(r) 

e7 -discretized.  almost-(l,  pp)-dists  P  e-discretized  r:R— >[-1,1] 
with  mean  1  -  2c  increasing,  odd 


is  within  ±Ke  of 


min  max  val<^p(r).  (4.16) 

e7 -discretized  ( l,pp)-dists  P  e-discretized  r:R— >[-1,1] 
with  mean  1-2 c  increasing  odd 

Proof.  We  sketch  the  proof,  which  uses  the  same  ideas  used  in  the  proof  of  Lemma  4.4.6. 
We  need  to  show  that  any  almost-(l,  po)-distribution  P  with  mean  1  -  2c  can  be  converted 
into  a  (l,po)-distribution  P'  with  mean  1  -2c  in  a  such  a  way  that  val(r)  changes  by  at 
most  0(e)  for  every  e-discretized,  increasing,  odd  r.  If  P  is  already  a  (l,po)-distribution 
then  we  are  done.  Otherwise,  it  has  support  on  two  neighboring  discretized  values,  say 
p'Q  <  p".  Since  the  mean  of  P  is  1  -  2c  we  must  have  p'0  <  1  -  2c.  We  now  form  P'  by 
pushing  the  weight  A  that  P  gave  to  p'('}  onto  p'Q.  This  changes  the  mean  by  A (p[J  -  p'0)  <  e1 , 
but  we  can  compensate  for  this  by  shifting  a  small  amount  of  weight  (at  most  2c7)  onto  the 
support  point  1.  One  bounds  the  change  in  val(r)  caused  by  these  shifts  by  0(c)  +  0(c7)  via 
I Pq  -  Pq  |  <  c7  and  Claim  4.4.7.  □ 

4.4.4  Undiscretizing 

We  have  now  reached  (4.16),  which  is  very  close  to  S(c);  the  only  difference  is  that  we  have 
discretized  distributions  and  functions.  We  now  ‘undiscretize’: 

Lemma  4.4.13.  There  is  a  universal  constant  K  <  oo  such  that  for  each  c  e  [|,  1], 

(4.16)  =  min  max  val<^p(r). 

e7 -discretized  (l,pp)-dists  P  e-discretized  r:R— >[-1,1] 
with  mean  1  -  2c  increasing  odd 
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is  within  ±Ke  of 


inf  sup  valc§p(r)  =  S(c).  (4.17) 

(1, po)-distributions  P  j] 

with  mean  1  —  2c  increasing,  odd 

Proof.  It  is  straightforward  to  see  that  the  ideas  from  Lemma  4.4.6  can  be  used  to  replace 
the  min  in  (4.16)  with  the  inf  from  (4.17),  changing  the  value  of  (4.16)  by  at  most  0(c). 
Thus  we  concentrate  on  discretizing  the  functions.  To  that  end,  fix  any  (1,  po)-distribution 
P  (in  fact,  our  argument  will  hold  for  any  distribution  on  [-1,1]).  We  will  show  that  for 
any  increasing,  odd  r  :  IR  — > ►  [-1, 1],  there  is  an  e-discretized,  increasing,  odd  r' :  R  — ►  [-1, 1] 
with  |val<^p(r)-  val^p(r')l  <  0(e).  This  will  complete  the  proof. 

So  let  r  be  given.  Define  the  increasing,  odd,  e-discretized  function  r' :  IR  — ►  [-1,1]  as 
follows:  On  each  finite  interval  I  in  J^e,  we  will  take  r’  to  be  identically  equal  to  the  value 
of  r  on  the  midpoint  of  I,6  rounded  to  the  nearest  integer  multiple  of  e  (or  ±1,  if  one  of 
these  is  closer).  As  necessary,  we  will  also  take  r’  to  be  identically  -1  on  (-oo,  -B]  and 
identically  1  on  LB,oo).  We  now  argue  that  val <gP(r')  is  within  ±0(e)  of  val<^p(r). 

The  idea  is  that  \r  -r'\  <  e  except  on  a  set  of  small  Gaussian  measure.  We  will  give 
up  on  the  two  half-infinite  intervals  and  include  them  in  the  exceptional  set.  As  for  the 
finite  intervals  in  J^e,  since  r  is  increasing  and  bounded  in  [-1, 1],  for  at  most  1/e  of  these 
intervals  can  r  increase  by  more  than  e.  On  the  intervals  where  it  increases  by  less  than 
e,  we  indeed  have  |r  -  r'\  <e.  Hence  \r  -  r'\  fails  on  at  most  1/e  intervals  of  width  e2,  plus 
perhaps  the  two  half-infinite  intervals  ±(-oo,.B].  Note  that  the  total  Gaussian  measure  of 
these  intervals  is  at  most  0(e).  It  is  thus  easy  to  see  that 

valo?p(r)  =  E  E  -  \r(x)r(y)] 

p~P  ( x,y )  p-corr’d 
Gaussians 

is  within  ±0(e)  of  val<^p(r'):  The  probability  that  either  x  or  y  falls  into  the  ‘bad’  intervals  is 
at  most  2  •  0(e),  since  x  and  y  are  each  individually  distributed  as  standard  Gaussians.  In 
this  case,  the  difference  in  values  is  at  most  1.  Otherwise,  we  have  that  |r(x)-r'(x)|,  |r(y)- 
r'(y) |  <  e,  and  then  the  difference  in  values  is  at  most  0(e).  □ 


Combining  all  of  the  Lemmas  4.4.6,  4.4.9,  4.4.11,  4.4.12,  4.4.13,  we  have  proved  Theo¬ 
rem  4.4.2. 


We  end  with  the  following  observation: 

Corollary  4.4.14.  Each  sup  in  the  definition  ofS(c),  as  well  as  the  inf,  is  achieved.  Hence 
S(c)  =  min  max  val^p(r). 

(l,po)-distributions  P  r:R— >[-1,1] 

with  mean  1-2 c  increasing,  odd 

6Since  we  are  working  in  L2(K),  technically  here  we  mean  the  value  of  any  increasing  representative  of 
r’s  equivalence  class. 
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Proof.  (Sketch.)  The  fact  that  the  sup  is  achieved  for  each  P  is  proved  in  Theorem  4.3.4. 
The  fact  that  the  inf  is  achieved  can  be  deduced  by  taking  a  converging  subsequence  of 
po’s,  and  using  the  discretization  Lemmas  4.4.6  and  4.4.13  to  show  that  the  max’s  for  close 
values  of  po  are  close.  □ 


4.5  Estimating  S{c)  Efficiently 

This  section  is  devoted  to  the  proof  of  Theorem  4.1.12: 

Theorem  4.1.12  There  is  an  algorithm  that,  on  input  c  e  [|,1]  and  e  >  0,  runs  in  time 
poly(l/e)  and  computes  S(c)  to  within  ±e. 

As  Lemma  4.4.13  shows,  S(c )  is  within  ±0(e)  of 

(4.16)  =  min  max  val<^p(r). 

e7 -discretized  (l,p0)-dists  P  e-discretized  r:R^[-l,l] 
with  mean  1  -  2c  increasing,  odd 

Since  we  can  enumerate  all  poly(l/e)  many  e7-discretized  (l,po)-distributions,  it  is  clearly 
sufficient  to  show  we  can  efficiently  estimate 

max  vale §p(r)  (4.18) 

e-discretized  r:IR— >[-1,1] 
increasing,  odd 

for  any  (1,  po)-distribution  P.  In  fact,  for  technical  reasons,  we  will  show  how  to  estimate  a 
slightly  different  quantity.  Specifically,  instead  of  using  the  rounding  function  discretiza¬ 
tion  described  in  Definition  4.4.1,  we  will  use  a  different  one: 

Definition  4.5.1.  Let  e  >  0  be  such  that  He 2  is  an  odd  integer.  We  define  Jie  to  be  the 
partition  of  IR  into  1/e2  intervals  of  equal  Gaussian  measure  e2. 7  We  say  that  a  function 
r  :  [R  — *■  [-1, 1]  is  e2-equidiscretized  if  r  is  constant  on  each  of  the  intervals  in  Jie. 

We  will  show  how  to  estimate 


sup  va  lc§P(r)  (4.19) 

e2-equi discretized  r:R— [-1,1] 
increasing,  odd 

to  within  ±0(e)  in  time  poly(l/e),  whenever  P  is  a  (l,po)-distribution.  Although  this  quan¬ 
tity  is  not  directly  comparable  to  (4.18),  nevertheless  with  only  minor  modifications  to  the 
proof  of  Lemma  4.4.13  one  can  show  that  S(c)  is  also  within  ±0(e)  of 

min  sup  val<gp(r). 

e7 -discretized  (l,po)-dists  P  e^-equi  discretized  r:(R— >[— 1,1] 
with  mean  1  -  2c  increasing,  odd 

(To  see  this,  first  note  that  the  function  discretization  step  hardly  changes.  Second,  the 
proof  of  Lemma  4.4.6  goes  through  with  e2-equidiscretized  functions  as  well  because  the 

7Which  partition  points  are  included  in  which  intervals  is  immaterial. 
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intervals  in  J?e  are  only  wider  than  the  intervals  in  J^e.)  Thus  efficient  estimation  of  (4.19) 
for  (l,po)-distributions  is  sufficient  to  establish  Theorem  4.1.12. 

The  reason  for  our  redefinition  of  discretization  is  the  following:  it  allows  us  to  drop 
the  conditions  ‘increasing,  odd’  from  the  optimization  problem  (4.19).  Specifically: 
Proposition  4.5.2.  Let  P  be  a  (1,  po)-distribution  and  consider  the  following  optimization 
problem: 

sup  valc^p(r).  (4.20) 

e2 -equidiscretized  r:08— >-[-1,1] 

There  exists  an  optimal  solution  r*  achieving  the  sup  which  is  both  increasing  and  odd. 

Proof.  The  proof  is  essentially  identical  to  that  of  Theorem  4.3.4;  the  key  point  is  that 
performing  Gaussian  rearrangement  on  an  c2-equidiscretized  function  yields  another  e2- 
equidiscretized  function.  D 

We  now  consider  (4.20).  Suppose  P  has  weight  1-p  on  the  point  1  and  weight  p  on 
the  point  po;  of  course,  p  -  2c/(l  -  po).  Let  us  index  the  intervals  in  from  left  to  right 
as  where  m  =  (1/e2  -  l)/2.  We  identify  an  e2-equidiscretized  function  r  with 

the  length-(2m  +  1)  vector  giving  its  value  on  each  interval;  we  will  write  rj  for  the  entry 
corresponding  to  Ij,  -m  <  j  <  m.  Finally,  we  write  Wp  for  the  (2m  +  1)  x  (2m  +  1)  matrix 
whose  ( j,k )  entry  equals  the  probability  that  a  p-correlated  pair  of  Gaussians  (x,y)  will 
satisfy  x  e  Ij,y  e  Ik-  Now 

va%(r)=  \-\ni-p)  Y  Wi  (j,k)rjrk+p  Y  Wpo(j,k)rjrk 

\  -m<j,k<m  —m<j,k<m 

and  hence  the  optimization  problem  (4.20)  is  equivalent  to  the  problem 

minimize  rT((l -p)W\  +pWPo)r, 

subject  to  -  1  <  rj  <  1  for  all  -m  <j<m. 

We  now  consider  the  Karush-Kuhn-Tucker  conditions  for  this  quadratic  program  and  con¬ 
clude  that  any  optimal  solution  r  must  satisfy 

Y  (( l-p)Wi(j,k)  +  pWPo(j,k))rk  =  0,  for  all  j  such  that  -1  <  rj  <  1.  (4.21) 

-m<k<m 

These  necessary  conditions  for  the  optimality  of  a  rounding  function  was  already  deter¬ 
mined  by  Feige  and  Langberg  [48]. 

The  key  observation  that  lets  us  make  efficient  use  of  the  conditions  is  that  we  know 
from  Proposition  4.5.2  that  there  is  an  optimal  increasing  odd  r* .  In  particular,  there  is 
some  0  <  mo  <  m  such  that 


r*  =  -1,  for  all  j  <  -mo, 

r*  =  1,  for  all  j  >  m0,  (4.22) 

-1  <  r*  <  1,  for  all  -mo  <  j  <  mo- 
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Thus  algorithmically,  we  can  try  all  possible  values  for  mo,  incurring  only  an  0(l/e2)  factor 
slowdown.  For  each  choice,  we  assume  an  r*  satisfying  the  conditions  (4.22),  and  we 
solve  (4.21)  for  the  remaining  unknown  values;  i.e.,  we  solve  the  square  system 

Y  ((l-pWi(j,k)  +  pWPo(j,k))rk=bj  for  all  -m0  <  j  <  m0,  (4.23) 

-mo<k<mo 

where  bj  =  L*<-mo((l- p)Wi(j ,k)  +  pWPo(j ,fe))-Zjfe>mo((l- p)Wi(j ,k)  +  pW Po(J ,k)).  We  are 
guaranteed  that  there  exists  an  optimal,  feasible  solution  r*  satisfying  (4.23)  for  at  least 
one  value  of  mo- 

4.5.1  Evading  Singularity 

The  above  discussion  suggests  a  poly(l/e)  time  algorithm  for  computing  (4.19)  exactly. 
There  are  two  problems  we  need  to  circumvent,  however.  The  first  problem  is  that,  algo¬ 
rithmically,  we  cannot  compute  the  values  Wp(j,k )  —  or  even  the  endpoints  of  the  intervals 
in  Jie  —  exactly.  The  more  challenging  problem  is  that  the  square  system  (4.23)  may  be 
singular,  in  which  case  it  may  produce  infinitely  solutions  that  would  need  to  be  tried.  As 
we  will  see,  once  we  take  care  of  the  latter  problem,  the  former  will  follow. 

Let  us  write  the  square  system  (4.23)  more  compactly  as 

((l-p)Mi>mo+pMp0tmo)s  =  b,  (4.24) 

where  Mp>mo  represents  the  square  submatrix  of  Wp  corresponding  to  indices  -m o . . .  mo, 
and  s  represents  the  truncation  of  the  vector  r  to  these  indices.  We  may  assume  here  that 
mo  >  1,  since  there  is  nothing  to  solve  for  if  mo  =  0  (note  that  r ^  must  be  0  by  oddness). 
Write  MP0tm0tP  =  (1  —  p)M\^mo  +  pM  POjmQ. 

We  are  concerned  about  the  possibility  that  det (Mpo,mQtP)  =  0.  More  generally,  we  are 
concerned  if  the  condition  number  k(MP0)„ l0>p)  is  very  large,  since  in  this  case  our  inability 
to  calculate  the  matrices  precisely  would  lead  to  very  inaccurate  solutions  to  (4.24). 

Since  the  matrix  Mp0;m0;P  is  symmetric,  its  condition  number  is 

k(M  p0,mo,p)  =  |4max(d4po,mo,p  )|/|A 

min  (Mp0tm0,p)\, 

where  Amax  and  Amin  denote  largest  and  smallest  eigenvalues  in  absolute  value.  Since 
each  Mp >mo  is  a  submatrix  of  the  stochastic  matrix  Wp,  its  maximum  eigenvalue  is  at  most 
1;  hence  we  need  only  worry  about  the  smallest  eigenvalue  of  MP0tmQtP.  Since  M i>mo  is  a 
multiple  of  the  identity  matrix,  it  can  be  simultaneously  diagonalized  with  Mpo,mQ,  and 
hence  the  eigenvalues  of  Mpo>m0tP  are  precisely 

{(1-  p)  +  pApo 

where  the  Ap0;mo(j)’s  are  the  eigenvalues  of  Mp0tmo.  It  is  easy  to  see  that  for  any  particular 
Apo,m0C/)>  the  set  of  p’s  for  which  (1  -  p)  +  p^p0>mo(j)  is  in  the  range  (-5,6)  is  an  interval  of 
width  at  most  25.  Hence  we  deduce  the  following: 
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Proposition  4.5.3.  For  each  po,  the  set 

■®po  "=  LJ  Ip  •  K(-^p0,m0,p)  ^  i/<5} 

l<mo<m 

is  a  collection  of  at  most  m  •  (2m  +  1)  =  0(  1/e4)  intervals  of  width  at  most  28  each. 

Our  trick  now  will  be  to  give  up  on  these  ‘bad’  p’s;  or  rather,  the  ‘bad’  c-values  with 
which  they  are  associated.  Recalling  the  relationship  p  =  2c/(l  -  po)  o  c  =  (1  -  po)p/2,  we 
have  that 

C  :=  U  1(1  ~  Po)p/2  : p  e  Bpo} 

e7-discretizedpo 

is  a  collection  of  at  most  0(  1/e11)  intervals  of  width  at  most  28  each.  And,  whenever  c  ^  C, 
we  are  assured  that  the  square  system  (4.24)  has  a  matrix  with  condition  number  at  most 
1/8. 

We  now  set  8  =  e15  and  use  the  following  algorithm  for  estimating  S(c).  Given  c,  we  try 
to  estimate  S(c')  for  all  values  c'  =  c  +  te14,  for  t  an  integer  with  |t|  <  1/e12.  If  we  manage 
to  succeed  for  some  c',  then  the  resulting  estimate  for  S(c')  will  also  be  a  ±0(e)  estimate 
for  S(c),  since  \c'  -  c|  <  e2  (and  see  the  proof  of  Corollary  4.4.4  regarding  the  continuity  of 
S).  There  are  at  most  0(  1/e11)  ‘bad’  intervals  comprising  C,  and  each  has  width  at  most 
28.  Since  28  «  e14,  each  such  interval  contains  at  most  one  possible  c';  but,  there  are 
2/e12  +  1  »  0(  1/e11)  possible  c' ,  and  hence  at  least  one  choice  must  fall  outside  C.  Hence 
we  will  succeed  for  at  least  one  c'. 

4.6  On  S(c )  and  Running  Times 
4.6.1  On  S(c ) 

As  we  have  shown,  S(c)  can  be  computed  to  within  ±e  in  time  poly(l/e);  we  believe  this 
result  justifies  our  claim  that  S(c )  is  ‘explicit’.  A  reasonable  way  to  understand  the  notion 
of ‘explicitness’  would  be  with  respect  to  the  ‘bit  model’  of  Braverman  and  Cook  [25];  in 
that  setting,  our  poly(  1/e)  time  algorithm  would  correspond  to  a  fairly  liberal  notion  of 
‘explicit’,  with  a  polylog(l/e)  time  algorithm  corresponding  to  a  fairly  demanding  notion  of 
‘explicit’.  The  latter  notion  is  the  level  of  explicitness  one  has  for,  e.g.,  arccos(l-2c)’.  On 
the  other  hand,  some  less  explicit-looking  bounds  have  been  given  for  related  problems; 
for  example,  Haagerup’s  bound  [72]  for  the  complex  Grothendieck  constant  is  8/n(ko  +  1), 
where  ko  is  the  unique  solution  of  the  equation 

n(k  +  1)  f nl2  cos2 1 

— TP, —  =  /  .  - dt 

8k  Vl-k2sin2t 

in  the  interval  [0,1].  This  value  can  surely  be  computed  to  within  ±e  in  time  poly(l/e); 
it  may  well  also  be  computable  in  time  polylog(l/e)  but  this  is,  at  least,  not  immediately 
obvious. 
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We  in  fact  used  the  algorithm  behind  Theorem  4.1.12  to  approximate  S(c )  for  the  values 
.505,  .510,  .515,  . . . ,  .840  (with  the  values  S(. 5)  =  .5  and  S(c )  =  arccos(l  -  2c)/n  for  c  > 
.844  being  already  known).  The  values  we  found  are  given  in  the  table  in  Section  4.15. 
We  were  not  completely  formal  about  the  approximation  process  and  thus  the  results  in 
Section  4.15  should  not  be  considered  rigorous.  In  particular,  the  approximations  of  the 
matrices  Wp  were  done  numerically  in  Matlab;  also,  the  problem  of  singularity  discussed 
in  section  4.5.1  did  not  seem  to  arise  and  so  we  disregarded  it.  We  can  also  report  that  the 
best  rounding  functions  r  arising  in  the  algorithm  were  very  close  to  being  s-linear,  in  all 
cases;  they  became  only  slightly  rounded  near  ±s  (convex  near  -s,  concave  near  s). 

4.6.2  On  the  Running  Time  of  the  Rounding  Algorithm 

As  shown  in  Corollary  4.4.4,  our  MAX  CUT  rounding  algorithm  is  efficient  (polynomial)  in 
terms  of  its  dependence  on  n,  the  number  of  vertices;  indeed,  the  running  time  is  domi¬ 
nated  by  the  time  for  SDP.  To  get  a  cut  that  is  provably  within  e  of  S(Opt(G)),  however,  our 
algorithm’s  dependence  on  e  is  exponential,  2°(1/e  \  As  we  will  discuss  in  Section  4.13,  all 
known  RPR2  algorithms  have  at  least  some  e  dependence  as  well.  This  dependence  is  at 
least  poly(l/e),  from  converting  expectation  results  to  high  probability  results;  in  some  pa¬ 
pers,  it  is  exponential  (as  in  the  derandomized  Goemans-Williamson  algorithm  from  [46]). 

In  practice,  we  feel  this  issue  is  not  very  important.  As  mentioned  in  the  previous  sec¬ 
tion,  we  observed  that  using  RPR2  with  s-linear  rounding  functions  (as  Feige  and  Lang- 
berg  suggested)  seems  nearly  optimal.  In  particular,  it  seems  to  achieve  cuts  that  are 
within  about  10-4  of  S(c),  across  all  values  of  c.  Further,  one  can  precompute  a  table  of 
which  value  of  ‘s’  to  use  for  ‘each’  possible  value  of  c  (suitably  discretized)  —  and  the  al¬ 
gorithm  knows  what  c  is  after  solving  the  SDP.  Thus  in  practice  one  can  achieve  within 
10-4  of  S(c)  with  no  real  running  time  overhead.  If  error  smaller  than  10  “4  is  desired, 
it  seems  one  can  perform  a  local  search  for  a  better  rounding  function,  starting  from  the 
appropriate  s-linear  function  and  modifying  it  slightly  near  ±s. 

Finally,  given  our  poly(  1/e)  time  algorithm  for  approximating  S(c)  to  within  ±e,  we  be¬ 
lieve  that  our  rounding  algorithm  should  also  be  able  to  have  this  improved  dependence. 
Since  this  is  not  the  main  focus  of  our  work,  we  will  only  briefly  describe  the  technical¬ 
ities  that  would  need  to  be  overcome.  Given  an  embedded  graph  G  with  p-distribution 
P,  the  idea  would  not  be  to  try  to  solve  the  Karush-Kuhn-Tucker  conditions  for  ^ p  — 
since  in  general  we  have  no  promise  that  the  optimal  rounding  function  for  <Sp  is  increas¬ 
ing,  we  wouldn’t  be  able  to  effectively  try  all  possibilities  for  where  it  is  +1.  Instead,  one 
might  simply  try  to  use  all  of  the  rounding  functions  constructed  in  the  determination  of 
S(c).  This  seems  as  though  it  should  work:  the  proof  of  Theorem  4.4.2  using  the  Minimax 
Theorem  seems  to  imply  that  a  convex  combination  of  the  optimal  rounding  functions  for 
(l,po)-distributions  will  achieve  at  least  S(c)  for  ^ p . 

Unfortunately,  several  technical  problems  crop  up.  First,  the  Minimax  proof  only  im¬ 
plies  that  ‘nearly’  (l,po)-distributions  are  the  worst  case,  and  it  is  unclear  if  we  can  ef¬ 
fectively  enumerate  these,  since  the  weight  to  distribute  to  the  three  points  is  not  com- 
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pletely  determined  by  c.  Second,  even  if  we  circumvent  this  problem,  the  Minimax  theo¬ 
rem  only  implies  that  some  convex  combination  of  all  the  optimal  rounding  functions  for 
(l,po)-distribution  will  be  good  for  (hp\  however,  our  algorithm  for  computing  Sic)  only 
finds  the  increasing  ones.  This  problem  too  might  be  circumventable  if  one  could  prove 
strict  increase  in  Borell’s  rearrangement  inequality  assuming  the  function  is  not  already 
monotone.  Such  an  ‘equality  condition’  result  is  probably  true,  but  is  currently  unknown. 
Finally,  even  if  both  of  these  issues  were  fixed,  we  still  have  the  problem  that  the  Karush- 
Kuhn-Tucker  conditions  might  be  a  singular  system  and  thus  have  multiple  (and  possibly 
very  many)  solutions,  all  of  which  theoretically  might  need  to  be  combined  by  the  ‘Function 
Player’. 


4.7  Ddictator-vs.-quasirandom  Tests 

In  this  section  we  discuss  Dictator  Tests  and  give  the  definitions  necessary  for  our  ‘dictator- 
vs. -quasirandom’  tests.  The  subsequent  two  sections  are  devoted  to  the  proof  that  GapTest(c) 
Sic). 

We  begin  with  an  essential  observation:  2-query  Dictator  Tests  are  nothing  more  than 
embedded  graphs  (see  Definition  4.2.1),  with  the  vertex  set  being  further  restricted  to  lie 
within  the  discrete  cube.  To  make  the  connection  clearer,  we  treat  the  discrete  cube  as 
lying  on  the  unit  sphere: 

Definition  4.7.1.  We  write  {-1,  l}n  =  {--r=,  -j=}n  for  the  discrete  cube,  since  it  is  convenient 

\Tl  \JTl 

to  have  {-1,1}"  cS"-1. 

Definition  4.1.7  defines  a  2-query,  ^-based  Long  Code  test  to  be  a  probability  distri¬ 
bution  on  pairs  (x,y)  £  {-1,1}"  x  {— 1,  l}71.  Since  we  think  of  the  Long  Code  test  as  testing 
fix)  f  fiy)  and  since  f  is  symmetric,  there  is  no  loss  in  generality  if  we  insist  that  the 
probability  distribution  be  symmetric  in  x  and  y.  But  such  a  symmetric  distribution  on 
{-1,  l}n  x  {-1,  l}'1  is  identical  to  a  weighted  undirected  graph  G  on  {-1,  1}",  with  self-loops 
allowed.  Note  that  this  is  an  embedded  graph,  with  the  additional  property  that  the  ver¬ 
tex  set  is  (a  subset  of)  {-1, 1}".  Further,  if  f  :{— 1,  l}'1  — *■  {-1, 1}  is  the  function  being  tested, 
then  |  \ fix) fiy)  is  1  if  fix)  f  fiy)  and  0  if  fix)  =  fiy).  Hence  the  probability  that  f 

passes  the  test  is  just  val  Gif)-  Extending  this  definition  to  functions  f  :  {-l,l}re  — » [-1,1], 
we  have  the  following: 

Definition  4.7.2.  A  dictator-vs. -quasirandom  test  for  n-bit  functions  f  :{— 1, 1}”  — ►  [-1, 1] 
is  an  embedded  graph  T  whose  vertex  set  is  {-l,l}n.  The  value  of  the  test  on  f  is  vslpif), 
and  this  is  sometimes  referred  to  as  the  probability  that  T  passes/accepts  f. 

Our  notion  of  the  ‘completeness’  of  a  dictator-vs. -quasirandom  test  is  essentially  as  in 
Definition  4.1.8:  the  least  probability  with  which  one  of  the  Dictators  passes: 

Definition  4.7.3.  The  ith  Dictator  function  Xi  '■  {—  1, 1}”  — *■  {-1, 1}  is  defined  by  Xi(x)  =  v7^' 
xt. 

Definition  4.7.4.  The  completeness  of  an  n-bit  dictator-vs. -quasirandom  test  T  is 

Completeness(T’)  =  min{valr(^j)l 

ie[n] 
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The  average  of  the  probabilities  with  which  Dictators  pass  a  test  T  is  precisely  its 
spread: 

Proposition  4.7.5.  Given  an  n-bit  dictator-vs.-quasirandom  test  T  =  ({-1,1  }n,E),  we  have 

Spread(T)  =  avg{valr(x;)}. 

ic[n\ 


Hence  Spread(T)  >  Completeness^). 


Proof. 


Spread(T)=  E  [b-bx-y]  =  E  [l 

(. x,y)~E  z  z  ( x,y)~E  z 


2^i=l 


Xtfi\ 


_1  y-1  Tl 

n^i= 1 


E  [b-bnxtyi]  =  avglvaljdli)}. 

(x,y)~E  ie[n] 


□ 

As  discussed  in  section  4.1.4  we  use  a  weakened  soundness  notion  for  dictator-vs.- 
quasirandom  tests;  specifically,  these  tests  only  need  to  reject  functions  that  are  suffi¬ 
ciently  ‘quasirandom’.  This  soundness  condition  allows  us  to  get  large  completeness/soundness 
gaps  despite  using  only  2  queries.  The  notion  of  being  ‘quasirandom’  is,  for  all  intents  and 
purposes,  the  same  as  the  notion  of  having  small  ‘low-degree  influences’  introduced  in  [99] 
and  used  in  previous  papers  on  UNlQUE-GAMES-hardness.  We  will  make  a  very  slightly 
different  definition  because  we  feel  it  is  more  natural.  To  make  this  definition  we  need  to 
recall  the  basics  of  Fourier  analysis  of  Boolean  functions. 

Analogous  to  the  Hermite  analysis  described  in  section  4.3.3,  the  space  of  functions 
L2({-1, 1}”)  under  the  uniform  distribution  has  a  complete  orthonormal  basis  given  by  the 
monomials  (ls)sc[re]: 

Xs(x) =  Y[(Vn-Xi). 
ieS 

One  can  uniquely  express  any  function  f  :{— 1,  l}n  — »  R  via  its  Fourier  expansion, 

f=  E  f(S)xs- 

Se[re] 

We  now  introduce  quasirandom  functions: 

Definition  4.7.6.  For  0  <  e,  S  <  1,  we  say  a  function  f  1, 1}"  — *  [-1, 1]  is  (e,  <5)-quasirandom 

if  for  each  i  e  \n\ 

Inff -1-5V)<c, 

where  we  define  the  (1  -  b)-attenuated  influence  of  i  on  f  to  be 

In^1~1~5\f)=  V  (1  -5)|S|_1/(S)2. 

SeW 

ieS 

Note  that  this  definition  becomes  stricter  when  e  or  S  decreases;  we  think  of  func¬ 
tions  as  being  ‘more  quasirandom’  when  8  and  (especially)  e  are  small.  As  an  example, 
Dictator  functions  %i  are  the  antithesis  of  being  quasirandom;  in  particular,  if  e  <  1  then 
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Xi  is  not  (e,5)-quasirandom  even  for  8  =  l.8  On  the  other  hand,  the  Majority  function  is 
extremely  quasirandom;  specifically,  (O(-^=),0)-quasirandom.  We  have  chosen  the  name 
quasirandom  based  on  the  ‘Invariance  Principle’  from  [116],  which  essentially  states  that 
if  f  :{-l,  1}"  — » [-1, 1]  is  very  quasirandom,  then  the  distribution  of 

E  ns)Y\Xi 

Sc[n]  ieS 

is  nearly  unchanged  whether  one  takes  the  Xf s  to  be  independent  ±1  bits  or  independent 
N( 0, 1)  Gaussians. 

Having  defined  quasirandom  functions,  we  give  the  soundness  notion  for  our  tests: 
Definition  4.7.7.  The  (e,d)-soundness  of  a  dictator-vs.  -quasirandom  test  T  for  functions 
{-1,1}" -[-1,1]  is 

Soundness^T)  =  max{valr(/‘) :  f  is  (e ,8)-quasirandom) . 

Given  this  definition,  the  most  natural  Property  Testing  question  to  ask  is  how  far 
apart  completeness  and  soundness  can  be  for  dictator-vs. -quasirandom  tests: 

Definition  4.7.8.  We  call  the  pair  (c,s)  a  dictator-vs. -quasirandom  test  (e,d)-gap  if  for  all 
sufficiently  large  n,  there  is  a  dictator-vs.-quasirandom  test  T(re)  for  functions  f  :  [-1, 1}"  — 
[-1, 1]  with  Completeness!!?^)  >  c  and  Soundness£;l5(T('l))  <  s.  We  call  the  pair  (c,s)  simply 
a  dictator-vs.-quasirandom  test  gap  if  Vp  >  0,3e,<5  >  0  such  that  ( c,s  +  r /)  is  a  dictator-vs.- 
quasirandom  test  ( e,8)-gap . 

Definition  4.7.9.  The  dictator-vs.-quasirandom  gap  curve  is  the  function  GapTegt :  [ ^ ,  1]  — ► 
[^,  1]  defined  by 

GapTest(c)  =  min[s  :  (c,s)  is  a  dictator-vs.-quasirandom  test  gap). 

(It  is  immediate  from  the  definitions  that  this  min  is  achieved;  i.e.,  we  needn’t  write 
inf.) 

In  the  next  section  we  will  show  that  GapTest(c)  <  S(c);  substituting  this  into  these  theo¬ 
rems  yields  our  results  from  section  4.1.6;  the  subsequent  section  will  be  devoted  to  the 
inequality  GapTest(c)  >  S(c),  whose  proof  completes  the  result  GapTest(c)  =  S(c).  Although 
the  inequality  GapTest(c)  >  GapSDP(c)  was  already  implicitly  proved  in  [107],  we  will  give 
an  alternate  direct  proof  which  clarifies  the  connection  between  SDP  rounding  algorithms 
and  dictator-vs.-quasirandom  testing.  Finally,  in  the  last  section  we  will  connect  dictator- 
vs.-quasirandom  tests  with  the  SDP-hardness  constructions  in  [3,  4,  86]. 


4.8  GapTest(c)  <  S(c ):  Invariance  Principle 

To  upper-bound  GapTegt(c),  we  need  to  determine  dictator-vs.-quasirandom  tests  with  com¬ 
pleteness  at  least  c  for  which  all  quasirandom  functions  pass  with  small  probability. 
Studying  just  how  small  this  soundness  can  be  is  very  similar  to  searching  for  the  largest 

8We  take  0°  =  1  in  the  definition. 
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possible  SDP  gap,  discussed  in  section  4.2.  For  example,  given  a  particular  test  T  on 
{-1,  1}"  with  Completeness(T)  >  c  and  Soundness^gCT)  <  s,  one  can  symmetrize  it  with  re¬ 
spect  to  all  2 nn\  symmetries  of  {-1, 1}”,  forming  T' .  Then  one  still  has  Completeness! T')  > 
c  and  SoundnessCj(dT')  <  s,  and  furthermore  T'  has  the  property  that  the  probability  of 
choosing  a  pair  (x,y)  depends  only  on  its  Hamming  distance;  i.e.,  only  on  (x,y).  Just  as  we 
switched  from  (which  insisted  on  (x,y)  being  precisely  p)  to  the  analytically-easier 
^p\  it  is  natural  to  switch  to  the  version  of  symmetrized  tests  with  independence  across 
coordinates: 

Definition  4.8.1.  We  define  the  noise  sensitivity  mixture  test  STpl)  on  {-1,1}"  by  analogy 
with  Gaussian  mixture  graphs.  In  particular  we  define  ( x,y )  to  be  p-correlated  /2-bit  strings 
if  x  is  drawn  uniformly  from  {-1,1}"  and  y  is  formed  by  taking  yi  =  xi  with  probability 
\  +  |p  and  yi  =  - Xi  with  probability  \  -  \p,  independently  across  i. 

We  remark  that  a  p-correlated  pair  ( x,y )  has  (x,y)  tightly  concentrated  around  p,  and 
that  further: 

Fact  4.8.2.  Completeness!^"^  =  Spread(P)  =  Ep~p[|  -  |p]. 

Also,  given  f :{- 1, 1}"  -*Rwe  use  the  notation 

§p(f)  =  E  [/WOO]. 

( x,y )  p-corr’d 
re-bit  strings 


The  reader  is  warned  that  we  use  the  notation  Sp(f)  for  both  f  :{- 1, 1}"  — ►  D?  and  f  e  L2([R") 
with  the  Gaussian  distribution.  For  more  on  noise  sensitivity  tests,  see  [99]. 

Having  decided  that  the  best  dictator-vs. -quasirandom  gaps  will  occur  essentially  with 
noise  sensitivity  mixture  tests,  the  ideas  from  section  4.2.3  again  apply.  The  Hermite  and 
Fourier  formulas  for  noise  stability  are  the  same  and  we  again  conclude  that  the  optimal 
mixture  should  come  from  a  (l,po)-distribution.  This  provides  an  explanation  for  why 
such  tests  were  useful  in  [102]. 

Finally,  to  upper-bound  the  value  of  quasirandom  functions  on  noise  sensitivity  (1,  po)- 
mixture  tests,  we  use  the  Invariance  Principle  of  [116]  (  which  is  also  stated  in  Sec¬ 
tion  3.2.1  without  explicitly  giving  the  error  bound)  to  reduce  to  the  analysis  of  the  Max 
Cut  in  Gaussian  mixture  graphs.  Then  Theorem  4.3.4  can  be  used  to  get  an  upper  bound 
ofS(c).  More  precisely,  we  prove  the  following  theorem: 

Theorem  4.8.3.  Let  P  be  any  (1,  po)- distribution  and  let  T  denote  the  dictator-vs.  -quasirandom 
test  STpl\  Then  for  any  t  >0, 

SoundnessTjf2(i/i0g(i/T))(T)  <  sup  val^p(r)  +  0(log(l/T)“1/8). 

r:IR— [-1,1] 
increasing,  odd 

Before  proving  Theorem  4.8.3,  let  us  see  how  it  implies  the  desired  result: 

Corollary  4.8.4.  GapTest(c)  <  S(c). 

Proof.  Let  P  be  the  (1,  po)-distribution  with  mean  l-2c  achieving  the  minimum  in  the  defi¬ 
nition  ofS(c)  (or  rather,  in  Corollary  4.4.14).  Writing  T  =  3~pU\  we  have  Completeness(T)  = 
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c  by  Fact  4.8.2.  Now  by  definition, 


sup  val^p(r) 
r:R— [-1,1] 
increasing,  odd 


is  precisely  S(c).  Hence  Theorem  4.8.3  implies  that  the  (e,d)-soundness  of  T  can  be  made 
at  most  Sic)  plus  an  arbitrarily  small  amount,  by  taking  e  and  8  sufficiently  small.  This 
establishes  GapTest(c)  <  S(c).  □ 


4.8.1  Proof  of  Theorem  4.8.3 

The  proof  is  an  extension  of  the  proof  of  the  Majority  Is  Stablest  theorem  from  [116]. 
Let  P,  T,  and  t  be  as  in  the  statement  of  the  theorem,  and  let  f  :  {-1,1}"  — >  [-1,1]  be  a 
(T,0(l/log(l/T)))-quasirandom  function.  We  need  to  show  that 


val  xif)  =  E  E 

p~P  ( x,y )  p-corr’d 

n-bit  strings 


ih-hf(x)fiy)]= 1  1 


2  2  [§p(0] 


is,  up  to  an  additive  0(log(  1/t)  1/8),  at  most 

val^p(r)  = 


sup 

r:R— [-1,1] 
increasing,  odd 

Equivalently,  we  must  show 


sup 

r:R— [-1,1] 
increasing,  odd 


2-lAlVO]  . 


1  1 

p~P 


E  [Sp(f)]  >  inf  E  [S p(r)]  -  0(log(l/-r)-1/8). 

p~P  1  F  1  r:R— >[-1,1]  p~P  L  F 

increasing,  odd 


(4.25) 


Let  us  write  p  for  the  weight  of  P  on  po-  Then  the  left  side  of  (4.25)  is 

(l-Pmf2]  +  p§p0(f). 

As  in  the  proof  of  Theorem  4.3.4,  this  quantity  can  only  decrease  if  we  replace  f  by  /‘odd, 
in  which  case  it  becomes 

(l-p)E[f2]-pS_P0(A  (4.26) 

analogous  to  (4.8).  (Note  that  a  similar  formula  will  arise  on  the  right  side  of  (4.25),  since 
the  r’s  are  odd.)  Since  fodA  has  the  same  Fourier  expansion  as  f  except  with  the  even- 
degree  terms  dropped,  we  have  that  Inf*1-5)(/"odd)  <  Inf^1_l5)(/"),  and  hence  f  =  fodd  is  still 
( e ,  <5)-quasirandom. 

We  now  set  y  =  O  and  distinguish  the  two  cases  po  <  -  l  +  3y  and  po  >  -l  +  3y: 
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Case  1:  po  <  -1  +  3 y.  In  this  case  we  use  § -Po(f)  ^  §i (f)  =  E[/2]  to  deduce  that  (4.26) 
is  at  least  1-2 p.  On  the  other  hand,  by  taking  r  -  sgn  (which  is  increasing  and  odd),  we 
conclude  that  the  term  on  the  right  side  of  (4.25)  satisfies 

inf  E  [§p(r)]  <  (l-p)E[sgn2]-pS_po(sgn)  =  (l-p)-p(l-0(N/y))  =  l-2p+0(N/y), 

r:R— >[-1,1]  p~P  L  r  P 

increasing,  odd 

where  we  used  the  estimate  §i_5(sgn)  =  l-0(\/<5).  Since  «  0(log(l/T)_1/8),  the  proof 

of  (4.25)  in  this  case  is  complete. 


Case  2:  po  >  -l  +  3y.  In  this  case  we  follow  the  arguments  from  [116]’s  proof  of  the 
Majority  Is  Stablest  theorem.  Write  p  =  -po  <  1  -  3y,  and  express  p  =  p'  •  (1  -  y)2.  We  let 
g  e  L2([ R")  be  the  multilinear  polynomial 

g(x  i,...,xn)=  £(l-y)|S|/XS)n*;, 

Sen  ieS 

and  we  let  g  :  IR"  [-1, 1]  be  the  function  defined  by 


\g(x) 

I  sgn(g(xr)) 


if  |g(^)|  <  1, 
else. 


We  note  that  f  being  odd  implies  that  both  g  and  g  are  odd.  Since 

E [f2]  =  £  f(S)2  =  £  g(S)2  =  E[g2]  >  E[g2], 

Ss[re]  SeN” 


we  have 

(4.26)  >  (1  -p)E[g2]  -pSp(/). 

Further,  using  the  fact  that  f  is  (t,  f2(l/log( l/T)))-quasirandom,  the  Invariance  Principle- 
based  arguments  in  [116]  imply  that 

I  Sp(/-)-Spdg)l<Tn(r). 


Hence  we  have 

(4.26)  >  ( 1  -p )E [g2 ] -p Sp/ (g) - rn(r)  =  (l-p)E[g2]+pS_p4p;)-Tn(r) 


l-2val^(n)(g) 


-T^\ 


where  the  first  equality  uses  the  fact  that  g  is  odd  and  where  P'  the  probability  distribu¬ 
tion  that  puts  weight  1  -p  on  1  and  weight  p  on  -p' .  But  P'  is  a  ‘(l,po)-distribution’,  and 
hence  Theorem  4.3.4  implies  that 


vaH(„)(g)  <  sup  val®  ,{r). 

P'  r:IR— >[-1,1] 

increasing,  odd 

Thus  we  have 

(4.26)  >  inf  E  fSp(r)l  -rn(r). 

r:R— >[-1,1]  p~P'L  P  J 

increasing,  odd 

By  taking  the  constant  in  the  definition  of  y  large  enough  we  get  «  0(log(l/r)_1/8). 
Thus  to  complete  the  proof  of  (4.25),  we  only  need  to  relate  the  inf  with  P  to  the  inf  with 
P' ,  using  the  fact  that  |(— p')  —  pol  <  O(y).  This  can  be  done  by  using  the  discretization 
Lemmas  4.4.6  and  4.4.13;  the  resulting  error  term  is  at  most  0(y1/7)  <  0(log(l/x)_1/8),  as 
required. 
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4.9  GapTest(c)  >S(c ):  RPR2  Algorithms  Imply  Testing  Lower 
Bounds 

In  this  section  we  discuss  ‘lower  bounds’  for  the  dictator-vs.-quasirandom  testing  prob¬ 
lem;  i.e.,  proofs  that  any  test  T  with  Completeness^)  =  c  cannot  have  Soundness^T) 
which  is  too  small.  As  mentioned  earlier,  Khot  and  Vishnoi’s  result  can  be  used  to  get  such 
lower  bounds:  it  gives  a  long  translation  of  a  (c,s)  dictator-vs.-quasirandom  test  gap  into 
a  (c  -  T],s  +  q)  SDP  gap  (with  triangle  inequality,  even),  for  arbitrarily  small  tj.  This  means 
that  an  SDP-rounding  guarantee  can  be  used  to  rule  out  the  existence  of  strong  dictator- 
vs.-quasirandom  tests.  A  similar  idea  arises  from  the  earlier  Theorem  4.1.11,  which  shows 
that  a  (c,s)  dictator-vs.-quasirandom  test  gap  can  be  translated  into  into  a  c  -  tj  vs.  s  +  rj 
UGC-hardness  result  for  Max  Cut.  Since  one  feels  it  is  unlikely  that  the  UGC  would  be 
disproved  via  an  elaborate  reduction  to  Max  Cut  followed  by  a  too-strong  SDP-rounding 
algorithm,  Theorem  4.1.11  also  suggests  that  SDP-rounding  algorithms  should  be  able  to 
prove  dictator-vs.-quasirandom  testing  lower  bounds. 

In  this  section  we  show  explicitly  and  directly  that  RPR2  algorithms  give  rise  to  dictator- 
vs.-quasirandom  testing  lower  bounds.  More  specifically,  the  following  theorem  implies 
(and  indeed  is  slightly  stronger  than)  the  result  GapTegt(c)  >  S(c): 

Theorem  4.9.1.  Let  e  >  0  be  given.  Then  for  all  n  >  0(l/e7),  if  T  is  any  dictator-vs.- 
quasirandom  test  for  functions  f  :  {— 1, 1}^  — » [-1,1]  satisfying  Completeness(T)  >  c,  then 
Soundnesse>o(T)  >  S(c)  -  e. 

Proof.  Let  T  be  a  such  a  test.  As  described  in  section  4.7,  T  can  be  thought  of  as  an  em¬ 
bedded  graph  on  the  vertex  set  {-1,1}”  Q  S”  .  Write  P  for  the  p-distribution  of  T,  and 
recall  from  Proposition  4.7.5  that  Spread(P)  >  Completeness(T)  >  c. 

Imagine  we  now  run  our  RPR2  Algorithm  4.2.4)  on  T,  with  the  discretization  param¬ 
eter  set  to  e'  :=  e/K.  By  Theorem  4.4.3,  it  will  at  some  point  hit  upon  an  e'-discretized, 
increasing,  odd  rounding  function  r*  :  US  — ►  [ — 1, 1]  which  satisfies 

A1gRpR2(T)  =  va Wp(r*)  >  S(Spread(P))  -  0(d)  >  S(c)  -  e/2,  (4.27) 

assuming  A  is  a  sufficiently  large  constant.  (Here  we  also  used  that  S  is  increasing.) 
Recall  that  when  we  run  the  RPR2  algorithm  with  r* ,  it  chooses  a  random  -dimensional 
Gaussian  Z  and  outputs  the  fractional  cut  fz  :  {-1, 1}”  — *■ [-1, 1]  defined  by 

fz(x)  =  r*(x-Z). 


Thus  (4.27)  is  equivalent  to 

E[vaWz)]>S(c)-e/2. 

z 

Our  goal  is  now  to  show  the  intuitively  plausible  claim  that  that  fz  is  very  likely  to  be  a 
quasirandom  Boolean  function: 

Claim  4.9.2.  With  probability  at  least  1  -  0(l/n)  over  over  the  choice  of  Z,  the  function  fz 
is  (0(  Vln  n/n )/e'2 , 0 )-quasirandom. 
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With  our  choice  of  n  >  0(l/e7),  this  claim  implies  that  with  probability  at  least  1  -  e/2 
the  function  fz  is  (c,0)-quasirandom.  This  in  turn  completes  the  proof  of  the  theorem, 
since  it  implies 


E[val r(fz)  I  fz  is  (c,0)-quasirandom]  >  S(c )  -  e/2  -  e/2. 
z 

Thus  there  must  exist  an  (e,0)-quasirandom  f  :  {-1,1}'1  — ►  [-1,1]  with  val y(f)  ^  S(c)-e, 
and  we  conclude  that  Soundness^o^)  >S(c)-e  as  needed. 


Proof,  (of  Claim  4.9.2.)  Given  Z,  let  us  write  f  =  fz  for  notational  simplicity.  Let  us  also 
write  y  =  0(\An  n/n)/e'2 .  We  need  to  show  that  with  probability  at  least  1  -  0(l/n), 


j  >  Inf°\f)  =  Inf i(f) 


E 

xe{— 1,1}" 


(f(xa=1))-f(xii=~1)) 


for  all  1  <  i  <  n.  (4.28) 


Here  we  have  used  the  notation  x(l=b)  for  the  string  x  with  the  ith  coordinate  set  to  b/s/n, 
along  with  the  well-known  alternate  definition  of  Boolean  influences  (see  [99]).  In  fact,  we 
will  show  that  (4.28)  holds  whenever  both  of  the  following  hold: 

|Z,;|<2\/lrm  for  all  1  <  i  <  n;  (4.29) 

\n<\\Z\\l<\n.  (4.30) 

Since  | Zf\  <  2\/ln n  for  each  i  except  with  probability  at  most  0(l/n2),  we  get  that  (4.29) 
holds  except  with  probability  0(l/n).  It’s  also  well  known  (and  the  proof  is  sketched  in  the 
proof  of  Theorem  4.3.3)  that  (4.30)  holds  except  with  exponentially  small  probability  in  n. 
Thus  both  (4.29)  and  (4.30)  hold  except  with  probability  at  most  0(l/n),  as  necessary. 


Let  us  henceforth  fix  Z  =  Z  satisfying  (4.29)  and  (4.30).  We  wish  to  prove  now  that  (4.28) 
holds.  We  will  show  that  it  holds  for  i  =  n,  and  the  fact  that  it  holds  for  1  <  i  <  n  will  follow 
by  an  identical  argument.  So  we  must  prove  that 


r 
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xe{-l,l}n 


((fix^^-fix^-V) 


E 

4  xef-l,!}"-1 


/  ln-1  y. 

LziXi  +  ^=' 


\;=i  v^J 


\i=i  Vn\ 

Using  the  fact  that  r  is  e'-discretized,  we  can  even  show  the  following  stronger  result: 


Pr 

xe{-l,l}n_1 


n- 1  ^  ■ 

Y\ZiXi±  — —  fall  into  different  intervals  from 

i= l  v^ 


■r- 


(4.31) 


Let  cr2  denote  Yfl=\Z2hn,  which  by  (4.29)  and  (4.30)  satisfies  |  <  cr2  <  |.  Now  the 
random  variable  has  distribution  close  to  that  of  a  mean-zero  Gaussian  with 
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variance  er2;  more  specifically,  using  the  Berry-Esseen  Theorem  we  have  that  for  every 
interval  7, 


Pr 

71-1 

7  ZiXi  £  I 

-Pr[AR0,cr2)  e  7] 

7  =  1 

<0 


'max;  \Zj\\ 
,  o  sfn  J 


0{\/\ognln). 


(4.32) 


The  analysis  is  now  very  similar  to  the  analysis  in  Claim  4.4.7.  Given  any  interval 
J  e  ,0ei ,  let  J'  denote  the  subinterval  gotten  by  moving  the  boundary  points  inwards  by 
3\/ln nln.  The  analysis  from  Claim  4.4.7  implies  that  a  standard  Gaussian  will  fall  into  one 
of  the  J'  intervals  except  with  probability  OW\nnlnle'2),  and  only  the  constant  in  the  O(-) 
changes  if  we  consider  instead  a  Gaussian  with  variance  a2  e  [^,  |].  Hence  the  same  is  true 
of  the  random  variable  using  (4.32).  But  whenever  this  random  variable  falls 

into  some  J',  we  get  that  £”7 1  ZiXi  ±  4^  are  both  in  the  associated  J,  since  \Zi  \  <  2\/ln n. 

l  —  ±  \JTl 

Since  we  took  y  =  0(V In nln)/e'2,  we  have  that  (4.31)  indeed  holds,  as  needed.  □ 


(Theorem  4.9.1) 


□ 


4.10  Hardness  Results  for  RPR2  Algorithms 

In  this  section  we  revisit  the  constructions  of  Karloff  [86],  Alon  and  Sudakov  [3],  and 
Alon,  Sudakov,  and  Zwick  [4].  The  purpose  of  these  constructions  is  to  demonstrate  that 
the  analysis  of  the  Goemans-Williamson  approximation  guarantee  is  tight  (and  likewise 
for  the  Zwick  [142]  approximation  guarantee,  in  the  case  of  [4]).  For  now  we  discuss  [3,  86], 
returning  to  [4]  at  the  end  of  the  section. 

The  works  [3,  86]  consider  the  graph  T  on  [-1,  l}n  in  which  a  pair  of  vertices  (x,y)  is 
connected  if  and  only  if  the  vertices’  inner  product  is  exactly  1  -  2c;  here  c  is  any  ratio¬ 
nal  parameter  in  (^,1).9  The  authors  show  (for  infinitely  many  n)  that  the  identity  map 
is  an  optimal  SDP  embedding,  and  hence  Opt(T)  =  Sdp(T)  =  c.  On  the  other  hand,  since 
every  edge  in  the  embedded  graph  connects  vectors  with  inner  product  1  -2c,  the  expected 
value  of  the  cut  output  by  the  GW  algorithm  (RPR2  with  the  rounding  function  sgn)  is 
only  arccos(l-2 c)/n.  Thus  (in  expectation,  at  least)  the  GW  approximation  curve  satisfies 
ApxGW(c)  <  arccos(l  -  2c)/n. 

As  the  reader  can  clearly  see,  this  construction  can  be  viewed  as  a  dictator- vs. -quasirandom 
test  with  completeness  c.  Indeed,  the  noise  sensitivity  test  of  [99]  is  almost  identical  to  it; 
the  only  difference  is  that  the  noise  sensitivity  test  picks  edges  with  expected  inner  product 
l-2c  rather  than  precise  inner  product  l-2c.  The  ‘soundness’  result  used  in  [3,  86]  is  that 
the  average  value  among  ‘random  halfspace  functions’  sgn(x-Z)  is  at  most  arccos(l-2c)/7r. 

As  we  saw  in  section  4.9,  these  random  halfspace  functions  are  almost  surely  quasiran¬ 
dom. 


9The  earlier  work  of  [86]  was  slightly  more  complicated  as  it  only  included  vertices  with  Hamming  weight 
exactly  n!2. 
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The  result  from  [3,  86]  has  some  additional  strengths  and  weaknesses.  One  strength 
is  that  the  SDP  embedding  used  has  all  of  its  unit  vectors  on  the  discrete  cube  {-1,1}"; 
hence  these  points  satisfy  the  triangle  inequalities,  and  indeed  satisfy  all  ‘valid’  inequali¬ 
ties  (see  [86]).  Thus  ApxGW(c)  is  still  at  most  arccos(l  -2 c)/n  even  if  the  SDP  with  triangle 
inequalities  is  used.  A  weakness  of  the  original  result  was  that  it  only  stated  that  the  ex¬ 
pected  value  of  the  cut  GW  produces  is  at  most  arccos(l  -  2c)/ n;  it  said  nothing,  e.g.,  about 
what  happens  if  the  GW  algorithm  is  run  several  times  and  the  best  resulting  cut  is  se¬ 
lected.  For  the  noise  sensitivity  version  of  the  test,  a  result  in  [99]  shows  that  GW  achieves 
at  most  arccos(l  -  2c)/n  +  o(l)  with  high  probability.  However,  Feige  and  Schechtman  [50] 
showed  an  even  better  result: 

Theorem  4.10.1  ([50]).  For  any  rational  c  £  (^,1)  and  any  q  >  0,  there  are  optimally  em¬ 
bedded  graphs  G,  with  arbitrarily  large  numbers  of  vertices,  satisfying: 

•  Opt(G)  =  Sdp(G)  =  c; 

•  the  vectors  in  G  satisfy  the  triangle  inequalities; 

•  every  halfspace  cut  has  value  at  most  arccos(l  -  2c)/n  +  q. 

The  conclusion  from  this  result  is  that  running  the  RPR2  algorithm  A  with  the  round¬ 
ing  function  sgn  cannot  achieve  Apx^(c)  >  arccos(l-2c)/7r,  even  if:  (i)  A  uses  the  SDP  with 
triangle  inequalities;  and,  (ii)  A  is  not  required  to  choose  Z  at  random  but  is  allowed  to 
use  the  best  possible  Z  of  length  \fn.  (When  r  =  sgn,  the  length  of  Z  is  irrelevant  and  may 
as  well  be  fixed.) 

Feige  and  Schechtman  prove  Theorem  4.10.1  (non-constructively)  as  follows:  They  be¬ 
gin  with  the  embedded  graph  T  on  {-1,1}"  constructed  in  [3,  86].  They  then  essentially 
take  G  to  consist  of  m  disjoint  copies  of  T,  each  embedded  in  a  random  ^.-dimensional  sub¬ 
space  ofKd.  If  d  »  n2  log m,  then  the  triangle  inequalities  hold  in  G  with  high  probability; 
on  the  other  hand,  if  d  is  not  too  large  then  it  can  be  shown  that  every  halfspace  cut  of  G 
has  value  at  most  arccos(l  -2 c)/n  +  q. 

We  now  prove  a  generalization  of  Theorem  4.10.1.  We  would  like  to  emphasize  that 
our  proof  follows  Feige  and  Schechtman ’s  extremely  closely. 

Theorem  4.10.2.  Suppose  (c,s)  is  a  dictator-vs.-quasirandom  gap,  and  q  >  0.  Fix  any 
RPR 2  rounding  function  r  which  is  piecewise  constant. 10  Then  there  are  embedded  graphs 
G  in  with  arbitrarily  large  numbers  of  vertices,  satisfying: 

•  Opt(G)  >  c; 

•  the  vectors  in  G  satisfy  the  triangle  inequalities; 

•  every  fractional  cut  fz  of  the  form  fz(u)  =  r(u  ■  Z)  satisfies  val  g(/z)  <s  +  q,  as  long  as 

||Z||2  =  ®(\/d). 

Proof.  Select  e,S  >  0  and  a  family  (T(n))  of  dictator-vs.-quasirandom  tests,  with  T'n’  oper¬ 
ating  on  {-1,1}",  such  that  Completeness!!1^)  >  c  and  Soundnesse;g(T„)  <  s  +  q/ 3,  for  all 
sufficiently  large  n.  We  would  also  like  to  assume  that  each  T(n)  is  regular,  meaning  that 
each  x  £  {-1, 1}"  participates  in  the  test  with  the  same  probability.  We  can  ensure  this  by 
symmetrizing  each  T ^  with  respect  to  the  2 nn\  symmetries  of  {-1,1}",  as  discussed  in 

10 As  all  functions  implemented  on  a  discrete  computer  must  be. 
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section  4.8.  (Alternatively,  the  dictator-vs. -quasirandom  tests  we  will  actually  use,  con¬ 
structed  in  section  4.8,  are  already  regular.) 

As  in  [50],  we  take  G  to  be  m  equally  weighted  disjoint  copies  of  T(n\  embedded 
on  the  unit  d -dimensional  sphere  Sd~ 1  with  independent  random  orientations.  Since 
Completeness(T(ra))  >  c,  certainly  Opt(G)  =  Opt(T(n))  >  c.  Also,  as  shown  in  [50],  if  d  » 
n2logm  then  the  vectors  in  G  satisfy  the  triangle  inequalities  with  high  probability;  this 
uses  the  fact  that  the  vectors  in  T{n)  satisfy  the  triangle  inequalities.  It  remains  to  analyze 
val g(/z)  for  all  possible  fractional  cuts  fz(u )  :=  r(u  • Z )  where  \\Z\\2  =  ®(Vd).  For  concrete¬ 
ness,  assume  that  this  means  (1  /c)\fd  <  \\Z\\2  <  c\fd  for  some  c  >  0. 

Let  us  consider  the  piecewise  constant  function  r.  Choose  a  small  enough  y  >  0  so  that 
the  set 

-  y,  t  +  y] :  t  is  a  point  of  discontinuity  for  r} 

has  total  measure  at  most  er]/0(\/c).  Following  [50],  we  now  take  a  y-net  Jf  for  the  set 
{Zl/c)\fd  <  || Z || 2  <  cVd}\  this  can  have  cardinality  0(cVd/y)d .  We  show  that,  with  high 
probability  over  the  orientations  of  G,  both  of  the  following  hold  for  all  i;  e 

1.  valqd/j,)  <  s  +  2p/3; 

2.  the  fraction  of  vertices  u  of  G  for  which  u  v  e  58  is  at  most  rjl 6. 

Having  shown  this,  it  follows  that  val g(/z)  <  s  +  tj  for  all  (1  /c)Vd  <  ||Z||2  <  cVd.  To  see 
this  for  a  given  Z,  take  v  to  be  the  closest  net  point.  Then  for  every  u  e  G  we  have 
\u  ■  Z  -  u  -  v\  <  \\u\\  ■  \\Z  -  v\\  <y.  It  follows  that  fz(u)  =  fv(u)  except  possibly  when  u  ■  v  e  SB. 
But  this  occurs  only  for  at  most  an  tj/ 6  fraction  of  vertices  in  G,  and  hence  at  most  an  p/6 
fraction  of  edge  weight,  by  regularity.  It  follows  that  |  val(j  (/z )  -  val^ ( )  |  <  2 p/6,  and  hence 
val g(/z)  -S  +  t],  as  required. 

It  remains  to  prove  that  items  (1)  and  (2)  above  indeed  hold  with  high  probability.  Fix 
any  v  £  J{  and  let  T\, . . . ,  Tm  denote  the  randomly  oriented  copies  of  T(n)  making  up  G.  In 
analyzing  some  Tj  vis-a-vis  v,  we  imagine  instead  that  the  orientation  of  Tj  is  fixed  and  v 
is  chosen  randomly  from  the  surface  of  the  sphere  of  radius  ||i>||2-  In  this  framework,  let  Y 
denote  the  projection  of  the  random  v  onto  the  ^-dimensional  subspace  containing  Tj.  Now 
the  projection  of  a  random  vector  from  the  surface  of  a  sphere  onto  a  lower-dimensional 
subspace  yields  a  distribution  which  is  close  to  Gaussian.  In  particular,  since  we  are  al¬ 
ready  assuming  d  »  n2logm  >  0(n2),  the  results  in  [39]  imply  that  the  variation  distance 
between  Y  and  the  n-dimensional  Gaussian  distribution  with  coordinate  variances  equal 
to  ||  ||  2/  vd  £  [l/c,c]  is  at  most  0(n/d )  =  0{l/rt).  If  Y  were  truly  drawn  from  that  Gaussian 
distribution,  then  we  would  have  the  following  (cf.  the  proof  of  Claim  4.9.2): 

•  the  expected  fraction  of  vertices  u  of  Tj  for  which  u  •  Y  e  8%  is  at  most  0(\fc\3B\)\ 

•  |Yj|  <  OW c\nn)  for  all  1  <  i  <  n; 

•  lcn<\\Y\\l<^n. 

Similar  to  the  proof  of  Claim  4.9.2,  the  last  two  of  these  imply  that  fv  is  a  ( e ,  0)-quasirandom 
cut  for  Tj,  as  long  as  0(V clnn/n)  <  y  and  0(\fc\G&\)  <  e.  The  latter  holds  by  design;  the 
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former  holds  so  long  as  we  take  n  >  poly(  c/y).  But  when  fv  is  a  ( e ,  0)-quasirandom  cut  for 
Ti,  we  have  val r,(/i>)  <  s  +  77/3.  Note  also  that  0(\/c \9B\)  <  77/24  by  design. 

Overall,  we  conclude  that  for  each  i  independently  we  have  val T^fv)  -  s  +  rj/Z,  ex¬ 
cept  with  probability  at  most  0(l/n)  over  the  choice  of  orientations.  If  we  ensure  that 
n  >  0(1/77),  we  conclude  that  the  expected  value  of  val Tt(fv)  is  at  most  s  +  77/2.  Similarly, 
we  can  conclude  that  the  expected  fraction  of  vertices  u  of  Ti  for  which  u  ■  v  e  SS  is  at 
most  77/12.  Since  val g(/d)  =  avg;e[m]val?/(/i;),  a  Chernoff  bound  implies  that  item  (1)  above 
holds  except  with  probability  at  most  exp(-0(772m)).  Similarly,  item  (2)  above  holds  except 
with  probability  at  most  exp(-0(772m)).  If  we  take  m  »  diogd  then  this  probability  will 
be  much  smaller  than  0{cVd/j)~d  (treating  c,  y,  and  77  as  constants),  and  so  we  get  that 
both  items  (1)  and  (2)  hold  with  high  probability  for  all  net  points  simultaneously,  by  a 
union  bound. 

As  in  [50],  the  overall  constraints  we  have  on  m  and  d  are  that  n2\ogm  «  d  «  m/logm, 
and  this  can  clearly  be  realized.  □ 

We  end  this  section  by  discussing  the  issue  of  self-loops  and  the  construction  of  Alon, 
Sudakov,  and  Zwick  [4],  If  we  use  Theorem  4.10.2  with  the  noise  sensitivity  (l,po)-mixture 
tests  constructed  in  section  4.8,  we  get  a  hard  instance  for  RPR2,  but  one  that  might 
be  considered  slightly  unsatisfactory:  this  is  because  the  embedded  graph  G  constructed 
has  self-loops.  However  one  can’t  simply  dismiss  embedded  graphs  with  self-loops,  be¬ 
cause  optimally  embedded  graphs  can  have  self-loops.  In  fact,  Alon,  Sudakov,  and  Zwick’s 
construction  is  the  following:  for  each  (1,  po)-mixture  distribution,  they  construct  a  self¬ 
loopless  graph  for  which  the  optimal  SDP  embedding  is  essentially  the  noise  sensitivity 
(l,po)-mixture  test.  More  precisely,  it  is  the  version  in  which  vertices  are  connected  if  their 
inner  product  is  exactly  po  or  1.  The  technique  of  [4]  involves  taking  the  (l,po)-mixture 
test  and  replacing  the  self-loops  by  cliques,  similar  to  the  self-loop  removal  technique  dis¬ 
cussed  in  Section  4.12. 


4.11  GapSDP(c)  is  Continuous 

In  this  Section  we  prove  Proposition  4.3.2.  The  fact  that  GapgDp(c)  is  increasing  on  [|,1] 
is  immediate  from  the  definition  (since  if  c'  >  c,  the  inf  for  c'  is  over  a  subset  of  the  inf 
for  c).  We  mainly  focus  on  the  proof  that  GapSDP(c)  is  continuous  on  (|,  1);  this  requires 
only  a  simple  trick  —  the  use  of  the  isolated  edge.  The  proof  of  continuity  at  1  requires 
appealing  to  Goemans-Williamson,  and  the  continuity  at  |  is  trivial.  Finally,  the  proof 
that  GapgDP(c)  is  strictly  increasing  requires  an  isolated  clique  trick,  plus  an  appeal  to  a 
result  of  Zwick  [142]. 

Definition  4.11.1.  Given  a  graph  G  and  a  parameter  0  <  e  <  1,  we  define  the  graph  G  u 
edgec  to  be  the  graph  in  which  G’s  edge-weights  are  scaled  by  a  factor  ofl-e,  and  then  two 
new  vertices  are  added,  with  an  edge  between  them  of  weight  e. 

The  following  is  easy  to  verify: 

Proposition  4.11.2.  Sdp(Guedgee)  =  (l-eOSdpiGf+e  and  Opt(Guedgee)  =  (l-e)Opt(G)+e. 
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We  now  prove: 

Proposition  4.11.3.  GapSDP(c)  is  continuous  on  (^,1). 

Proof.  We  first  prove  right-continuity  on  (^,1).  Suppose  c  e  (^,1),  and  let  s  =  GapSDP(c). 
Given  any  sufficiently  small  e  >  0,  assume  c  <  c'  <  c  +  (1  -  c)e/2  <  1.  By  the  definition 
of  Gapsop(c)  =  s  we  can  find  some  graph  G  with  Sdp(G)  >  c  and  Opt(G)  <  s  +  e/2.  Let 
G  =  G  u  edgee/2.  Then  we  have  Sdp(G)  >  (1  -  e/2 )c  +  e/2  =  c  +  (1  -  c)e/2  >  c' ,  and  further, 
Opt(G)  <  (1  -  e/2 )(s  +  e/2)  +  e/2  <  s  +  e.  This  proves  GapgDP(c')  <  s  +  e.  Since  GapSDP  is  in¬ 
creasing,  we  have  proven  right-continuity  at  c. 

The  proof  of  left-continuity  on  (|,  1)  is  similar.  Suppose  c  e  (|,  1),  and  let  s  =  GapSDP(c). 
Given  any  sufficiently  small  e  >  0,  assume  ^  <  c  -  2e(l  —  c)<c'<c.  For  any  graph  G  with 
Sdp(G)  >  c' ,  let  G  =  Guedge2e.  We  have  Sdp(G)  >  (l-2c)c,+2e:  =  c' +2e(l-c')  >  c'+2e(l-c)  > 
c  and  also  Opt(G)  =  (1  -  2e)Opt(G)  +  2e.  By  the  definition  of  GapSDP(c)  =  s,  it  holds  that 
Opt(G)  >  s.  Hence  (1  -  2e)Opt(G)  +  2  e>  s  which  implies  Opt(G)  >  s  -  (1  -  Opt(G))2e  >  s  -  e. 
This  proves  GapgDP(c')  >  s  -  e.  Since  GapgDP  is  increasing,  we  have  proven  left-continuity 
at  c.  □ 

We  next  check  continuity  at  the  endpoints,  c  =  ^,1.  It’s  easy  to  see  that  if  Sdp(G)  =  1 
then  G  must  be  bipartite  and  so  Opt(G)  =  1.  Hence  GapSDP(l)  =  1.  Next,  by  taking  the 
sequence  of  complete  graphs  Km  (each  with  total  edge-weight  1),  which  satisfy  Opt (Km)  < 
\  ^  — *■  |  as  m  — >■  oo,  we  see  that  GapSDP(^)  =  Thus  to  check  continuity  at  the  endpoints 
we  need  to  show  that  limc^(i/2)+  GapgDP(c)  =  ^  and  limc^i-  GapgDP(c)  =  1. 

The  first  of  these  follows  simply  because  GapSDP(c)  is  sandwiched  between  |  and  c  for 
all  c.  For  the  second  of  these,  suppose  G  is  any  graph  with  Sdp(G)  >  1  —  e.  The  analysis 
of  Goemans  and  Williamson  [59]  implies  that  one  can  find  a  cut  in  G  with  value  at  least 
1  -  O(v^).  Thus  GapgDP(l  -  e)  >  1  -  0( \fe),  and  so  limc^i-  GapgDP(c)  =  1  as  claimed. 

Finally,  we  check  that  GapgDP(c)  is  strictly  increasing.  For  this  we  introduce  isolated 
cliques: 

Definition  4.11.4.  Given  a  graph  G  and  two  parameters  m  e  l\l  and  0  <  e  <  1 ,  we  define  the 
graph  G  u Km^e  to  be  the  graph  in  which  G’s  edge-weights  are  scaled  by  a  factor  of  1  -  e,  and 
then  an  isolated  m-clique  is  added,  whose  total  edge-weight  is  e. 

Using  the  fact  that  Opt(ifm>i)  <  \  +  one  can  check: 

Proposition  4.11.5.  Sdp(Guifm;e)  >  (l-c)Sdp(G)  +  e/2  and  Opt(Guifm;e)  <  (l-e-)Opt(G)  + 

We  now  have: 

Proposition  4.11.6.  GapgDP(c)  is  strictly  increasing  on  [|,  1]. 

Proof.  It’s  enough  to  check  this  on  (|,  1).  So  suppose  \  <  c  <  c'  <  1,  and  write  s'  =  GapgDP(c'). 
Zwick  [142]  was  the  first  to  show  that  c'  >  \  implies  s'  >  Charikar  and  Wirth  [30]  specif¬ 
ically  proved  that  Sdp(G)  >  \  +  y  implies  Opt(G)  >  |  +  G(y/log(l/y)).  Thus  we  have  s'  > 
Write  e  =  (c'  -  c)/c' .  Select  m  large  enough  that  s'  -  |  ^  is  still  strictly  positive.  Finally, 

take  8  >  0  so  that  8  <  (s'  -  \  -  ^)e. 
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By  definition  of  GapgDpfc')  =  s',  we  can  find  a  graph  G'  with  Sdp(G')  >  c'  and  Opt(G')  < 
s'  +  8.  Let  G  =  G'  u  Kme.  Then  Sdp(G)  >  (1  -  e)c'  +  e/2  >  (1  -  e)c'  =  c.  Further,  Opt(G)  < 
(1  -  e)(s'  +  <5)  +  (|  +  ^)e<s'  +  (^  +  ^-  s')e  +  8<s'-8  +  8  =  s'.  We  conclude  that  GapgDp(c)  < 
s'  =  GapsDp(c').  Thus  GapgDP(c)  is  indeed  strictly  increasing.  □ 


4.12  SDP  Gaps  Based  on  Infinite,  Self-looped  Graphs 

In  this  Section  we  prove  Proposition  4.3.1. 


Proof.  Write  Go  =  G.  We  will  transform  Go  into  Gi,  an  infinite  graph  on  vertex  set  B(f, 
then  Gi  into  G2,  a  finite  graph  (with  self-loops);  then  G2  into  G 3,  a  self-loopless  graph; 
then  G3  into  G4,  an  unweighted  graph.  The  desired  graph  will  then  be  G'  =  G4.  The  first 
transformation  uses  the  idea  of  embedded  graphs,  and  the  remaining  transformations  are 
all  previously  known. 

Let  g  :  IRd  — ►  Bd  achieving  the  sup  in  the  definition  of  Sdp(Go)  to  within  e.  Let  Gi  be  the 
infinite  graph  on  B d  given  by  pushing  forward  Go  via  g,  i.e.,  G\(A,B)  =  Go(g~1(A),g~1(B)) 
(here  we’re  identifying  a  graph  with  the  probability  measure  defining  its  ‘edge  weights’). 
We  immediately  get  E(x )3,)~Gi[^  -  \x  •  y]  >  c  -  e.  We  can  think  of  this  as  saying: 

‘Sdp(Gi)’  >c-e,  (4.33) 

with  the  identity  mapping  as  the  embedding.  Further, 

Opt(Gi)  <  s,  (4.34) 

because  for  any  fractional  cut  h  :  Bd  — »  [-1,1]  for  Gi,  the  cut  ho g  :Ud  —>  [-1,1]  for  Go 
achieves  the  same  value,  Eg^gJ^  -  \h(x)h(y)\  =  E(x):y)~Go[^  -  \(h  o g(x))(h  o g(y))\. 

We  next  discretize  Gi  in  the  manner  of,  say,  Feige  and  Schechtman  [50].  Choose  an 
e-net  Jf  within  Bd  of  size  at  most  0(l/e)d.  Further,  partition  Bd  into  Voronoi  cells  based 
on  J{ ,  with  a  disjoint  cell  Cv  for  each  v  e  .  Now  define  the  (finite)  graph  G2  on  JV  by 
taking  G2(u,v)  =  Gi(Cu,Cv)  (again,  we  identify  a  graph  with  its  edge  distribution).  We 
claim 

Sdp(G2)  >  c  -  3e.  (4.35) 

To  see  this,  recall  that  the  identity  embedding  for  Gi  achieves  E^y^Git!  -  ■  y]  >  c  -  e. 

Now  if  x  is  in  the  cell  Cu  and  y  is  in  the  cell  Cy,  then  x-y  =  (u  +  tji)-(v  +  772)  for  some  vectors 
77 1, 772  of  length  at  most  e;  this  implies  \x  ■  y  -  u  ■  v\  <  3e.  Since  we  can  draw  from  G2  by 
drawing  ( x,y )  ~  Gi  and  then  taking  ( u,v )  such  that  x  e  Cu  and  y  £  Cv,  we  conclude  that 
E(u,v)~G2[^  -  \u-v]>c-e-  |e.  We  conclude  that  (4.35)  holds  with  the  identity  map  as  the 
embedding.  The  fact  that 

Opt(G2)  <  s  (4.36) 

follows  for  the  same  reason  as  (4.34)  —  any  cut  for  G2  can  be  extended  to  an  equally  good 
cut  for  G 1. 
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We  now  eliminate  self-loops  from  G 2,  forming  G 3,  using  the  construction  in  the  Section 
of  Khot  and  O’Donnell  [102],  which  itself  is  based  on  a  trick  of  Arora,  Berger,  Hazan, 
Kindler,  and  Safra  [8].  It  is  shown  therein  that  for  any  e  >  0,  we  can  take  G3  to  have 
0(l/e)2  times  as  many  vertices  as  G2,  and  satisfy 

Sdp(G3)>Sdp(G2)>c-3e,  (4.37) 

and 

Opt(G3)  <  Opt(G2)  <  s  +  e.  (4.38) 

Finally,  we  form  G'  =  G4  from  G3,  converting  weighted  edges  to  unweighted  edges. 
There  is  a  simple  randomized  way  to  do  this  (see,  e.g.,  [19,  35]),  taking  a  weighted  graph 
on  m  vertices  into  an  unweighted  one  on  poly(m/e)  vertices,  such  that 

Sdp(G4)  >  Sdp(G3)-e  >  c  -4c,  (4.39) 

and 

Opt(G4)<Opt(G3)  +  e<s  +  2e.  (4.40) 

Since  G3  has  0(l/e)d+2  vertices,  our  G4  has  n  =  (l/e)°(d)  vertices,  as  claimed.  The  proof 
follows  after  replacing  e  by  e/4.  □ 

4.13  RPR2  —  Implementation  Issues 

In  this  section  we  mention  a  few  implementation  issues  that  arise  in  the  use  of  the  RPR2 
framework  and  discuss  how  they  affect  our  algorithmic  guarantees.  All  of  these  issues 
have  been  considered  before;  see  [46,  48,  50,  59,  112], 

Exact  Solving  of  the  SDP.  The  SDP-solving  guarantee  one  actually  has  is  that  a  solu¬ 
tion  within  e  of  optimum  can  be  found  in  time  poly(n)  •  log(l/e).  We  have  already  treated 
this  issue  in  the  proof  of  Corollary  4.4.4.  Another  related  issue  is  that  the  vectors  returned 
by  the  SDP-solver  may  not  lie  precisely  on  the  unit  sphere,  something  we  assumed  in  our 
analysis.  This  can  be  taken  care  of  by  shrinking  all  vectors  slightly  so  that  they  lie  within 
the  unit  ball,  and  then  adding  a  fictitious  extra  coordinate  with  tiny  values  to  make  the 
vectors  have  length  exactly  1. 

Choosing  Gaussian  Random  Variables.  Again,  this  can  not  be  done  precisely,  but  the 
approximation  methods  of  Mahajan  and  Ramesh  [112]  shows  that  one  can  occur  e  loss  at 
the  expense  only  of  poly(«,  1/e)  time. 

Expectation  vs.  High  probability  vs.  Deterministic.  Our  results  have  been  con¬ 
cerned  with  showing  the  expected  value  of  the  fractional  cut  produced  by  the  (randomized) 
RPR2  algorithm  is  at  least  S(c).  One  can  turn  this  into  a  high-probability  result,  losing 
only  an  additive  e  in  cut  value,  by  using  poly(n.,  1/e)  independent  repetitions.  Alternatively, 
one  can  derandomize  the  RPR2  framework,  again  losing  only  an  additive  e  in  the  cut  value, 
via  the  method  of  conditional  expectations;  this  can  be  done  in  poly(?i,l/e)  time  [112]  or 
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0(n)  •  2poly^1/^  time  [46].  Having  done  either  of  these,  one  has  a  fractional  cut  with  value 
at  least  S(c)  -  e.  This  can  be  converted  into  a  proper  cut  with  at  least  the  same  value  by 
the  method  of  conditional  expectations. 

Multiple  Rounding  Functions.  As  discussed  in  section  4.1.3,  we  also  want  to  try  a 
collection  91  of  rounding  functions.  For  a  high-probability  results,  we  can  simply  repeat 
the  algorithm  0{\9l\\og\9l\)  times  for  each  rounding  function  and  this  will  achieve  what 
the  best  of  them  does.  Alternatively,  we  can  just  use  the  derandomized  algorithms  once 
for  each  r  e  91. 

Proper  Cuts  when  G  Has  Self-loops.  Given  a  graph  G  with  self-loops,  we  cannot 
actually  find  proper  cuts  with  value  at  least  S(Sdp(G)).  For  example,  if  G  consists  of  a 
single  self-loop  then  Sdp(G)  =  \  (via  the  embedding  mapping  the  vertex  to  0),  but  there 
is  no  proper  cut  of  value  ^ .  The  way  to  interpret  our  guarantee  for  graphs  G  with  self- 
loops  is  as  follows:  First,  remove  the  self-loops  from  G,  forming  G'  —  note  that  this  does 
not  change  the  value  of  the  optimal  proper  cut.  Then  our  algorithm  achieves  at  least 
S(Sdp(G'))  -e>  S(0(G))  -  e,  where  0(G)  denotes  the  value  of  the  optimal  proper  cut  in  G. 


4.14  Improved  Asymptotics  of  S(|  +  e) 

As  described  in  section  4.1.8,  Charikar  and  Wirth  [30]  established  GapSDp(^  +c)  >  |  + 
n(e/ln(l/e))  and  Khot  and  O’Donnell  [102]  established  GapgDP(|  +  e)  <  \  +  0(c/ln(l/e-)).  In 
this  Section  we  carefully  examine  these  proofs  and  conclude  the  following: 

Theorem  4.14.1.  GapSDp(|  +  0  =  S(^  +  e)  =  \  +(|  +  o(l))-e/ln(l/e). 

Proof.  We  upper-bound  S(c)  essentially  by  repeating  the  argument  in  [102],  paying  more 
attention  to  the  constants.  Take  P  to  be  the  (l,po)-distribution  with  weight  p  =  |  +  on 
po  =  and  weight  ^  on  1.  Now  if  r  :  R  — *  [-1, 1]  is  any  odd  one-dimensional  rounding 
function,  we  have 

val  *p(r)  =  \  -  \  [(§  -  |e)Si  (r)  +  (§  +  |e)S_i/2(r)]  =  \-  £  (g  -  f  +  (§  +  f  X-|)s)  r{s? 

odd  s 

<i+er(l)2-  £  (J-|OF(s)2  =  i  +  cr(l)2-(|-|e)E[(r-Lr)2],  (4.41) 

odd  s>3 

where  L  denotes  the  ‘projection  to  degree  1’  operator;  i.e.,  Lr{x)  =  r{x)x.  As  in  [102]  we 
consider  the  value  of  cr2  :=  r(l)2  =  E[(Lr)2],  the  variance  of  the  Gaussian  Lr(x).  Using 
|r|  <  1,  we  lower-bound 


E[(r  -Lr)2]  >  E[1{,lh>i}  •  (sgn(r)  -Lr)2], 

which  asymptotically  is  cr®^1)  •  exp(-l/2cr2).  If  a  »  l/\/2  ln(l/e)  then  the  final  term  in  (4.41) 
will  exceed  e,  making  the  overall  quantity  less  than  Thus  in  upper-bounding  (4.41)  we 
can  assume  a  <  (l+o(l))/\/21n(l/e),  and  thus  we  get  an  upper  bound  of  |+(^+o(l))e:/ln(l/e), 
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c 

Sic) 

c 

Sic) 

c 

S(c) 

c 

Sic) 

0.505 

0.5008 

0.590 

0.5414 

0.675 

0.6012 

0.760 

0.6694 

0.510 

0.5021 

0.595 

0.5446 

0.680 

0.6050 

0.765 

0.6736 

0.515 

0.5036 

0.600 

0.5478 

0.685 

0.6089 

0.770 

0.6778 

0.520 

0.5053 

0.605 

0.5510 

0.690 

0.6127 

0.775 

0.6820 

0.525 

0.5072 

0.610 

0.5544 

0.695 

0.6167 

0.780 

0.6862 

0.530 

0.5092 

0.615 

0.5577 

0.700 

0.6206 

0.785 

0.6905 

0.535 

0.5113 

0.620 

0.5611 

0.705 

0.6245 

0.790 

0.6947 

0.540 

0.5136 

0.625 

0.5646 

0.710 

0.6285 

0.795 

0.6990 

0.545 

0.5160 

0.630 

0.5681 

0.715 

0.6325 

0.800 

0.7033 

0.550 

0.5185 

0.635 

0.5716 

0.720 

0.6365 

0.805 

0.7076 

0.555 

0.5211 

0.640 

0.5752 

0.725 

0.6406 

0.810 

0.7119 

0.560 

0.5238 

0.645 

0.5788 

0.730 

0.6446 

0.815 

0.7162 

0.565 

0.5265 

0.650 

0.5825 

0.735 

0.6487 

0.820 

0.7206 

0.570 

0.5294 

0.655 

0.5861 

0.740 

0.6528 

0.825 

0.7249 

0.575 

0.5323 

0.660 

0.5898 

0.745 

0.6569 

0.830 

0.7293 

0.580 

0.5352 

0.665 

0.5936 

0.750 

0.6611 

0.835 

0.7336 

0.585 

0.5383 

0.670 

0.5974 

0.755 

0.6652 

0.840 

0.7380 

Table  4.1:  Value  of  Sic ) 


as  claimed. 

To  lower-bound  GapgDP(^  +  e )  we  refer  to  [30,  equation  (11)],  which  shows  that 

i  1  c  _  m2  /  9 

GaPSDp(2  +  ^  ~  2  +  T2  ~  46 

for  every  T  >  1.  By  taking  T  =  (1  -  o(l))  •  \/2  ln(l/e),  we  get  a  lower  bound  of  \  +  -  o(l»  • 

e/ln(l/e).  □ 


4.15  Approximate  Values  of  S(c ) 

is  alsso  equivalent  to  the  following  graph  problem.  Given  an  undirected  graph 
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Chapter  5 


Conditional  Hardness  of 
Approximating  Satisfiable  3-CSPs 
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5.1  Introduction 


In  this  chapter,  we  study  the  approximability  of  3-CSP;  in  particular  we  are  interested  in 
instances  of  3-CSP  that  are  satisfiable  (i.e.,  instance  with  optimum  value  being  1).  We 
study  the  largest  s  such  that  we  can  still  (l,s)-approximate  Max  3-CSPs  in  polynomial 
time.  In  addition  to  the  importance  of  the  problem  by  itself,  it  has  a  strong  motivation  in 
the  study  of  Probability  Checkable  Proof. 

5.1.1  The  PCP  Characterization  of  NP 

The  famous  PCP  (Probabilistic  Checkable  Proof)  Theorem  states  that  any  language  in  NP 
has  a  proof  system  where  the  proofs  can  be  probabilistically  checked  in  a  query-efficient 
way.  The  notation  PCPc,s(r(u),g(re))  stands  for  the  class  of  languages  M  verifiable  by  a 
proof  system  with  the  following  parameters:  for  an  input  x  of  length  n,  the  verifier  uses 
r(n)  random  bits  and  query  q(n)  bits  in  the  proof  to  decide  in  polynomial  time  whether 
x  is  in  M  or  not.  The  verifier  has  the  following  performance  guarantees:  i)  if  x  is  in  M, 
there  exists  a  proof  that  passes  with  probability  c  and  ii)  if  x  is  not  in  the  M,  no  proof 
passes  with  probability  more  than  s.  We  call  c  the  completeness  and  s  the  soundness  of  the 
verifier. 

If  the  verifier  makes  all  its  queries  at  one  time  based  only  on  x  and  the  r(n)  random 
bits,  it  is  called  nonadaptive.  On  the  other  hand,  if  the  verifier  picks  next  query’s  location 
based  on  x,  the  random  bits  and  all  the  previous  queries,  it  is  called  adaptive.  The  nota¬ 
tion  aPCP  and  naPCP  is  used  to  distinguish  languages  verifiable  by  the  adaptive  and  the 
nonadaptive  verifier.  The  adaptive  verifier  has  better  performance  while  the  nonadaptive 
verifier  has  more  natural  implication  to  the  hardness  of  approximation  for  CSPs  (See  The¬ 
orem  5.1.3  for  more  discussions).  We  mainly  focus  on  the  nonadaptive  proof  system  in  this 
work. 

Formally,  the  PCP  Theorem  [9,  10]  states  that: 

Theorem  5.1.1.  NP  Q  naPCPi;i/2(0(logn,0(l)). 

We  can  see  that  c  is  1  in  the  PCP  Theorem;  i.e.,  when  the  input  x  is  in  the  language, 
there  exists  a  proof  that  passes  with  probability  1.  Such  a  verifier  is  said  to  have  perfect 
completeness,  which  is  a  natural  and  desirable  property  of  the  proof  system.  Much  effort 
is  devoted  to  optimizing  the  tradeoff  between  q(n)  and  s  (as  well  as  some  other  parameters 
such  as  proof  length,  adaptativity,  free  bit  complexity)  [19,  66,  76,  130].  It  is  known  that  in 
order  to  make  c  to  be  1  and  bound  s  away  from  1,  the  minimum  number  of  queries  that  the 
verifier  need  to  make  is  3.  The  subject  study  in  this  work  is  then  to  optimize  the  sound¬ 
ness  s  for  the  3-query  nonadaptive  PCP  systems  with  perfect  completeness.  Formally,  we 
examine  the  following  question: 

Question  5.1.2.  What  is  the  smallest  s  that  makes  NP  c  naPCPi)S[0(log«),3] 

This  problem  was  first  studied  in  [19]  where  Bellare,  Goldreich  and  Sudan  showed 
that  NP  c  naPCPi,o.8999+e[0(logu),3].  Hastad  [76]  further  improved  this  result  to  NP  Q 
naPCPi^+efOQogfulXS).  Around  the  same  time,  Zwick  [141]  showed  that  naPCPi5/8(0(log(ra)),3)  Q 
BPP  by  giving  a  randomized  polynomial-time  5/8-approximation  algorithm  for  satisfiable 
3CSP.  Therefore  unless  NPc  BPP,  the  best  s  must  be  bigger  than  5/8.  Zwick  further  conjec¬ 
tured  that  this  algorithm  is  optimal,  i.e.,  NP  Q  naPCPi^/s+efOGogn)^)  (See  Section  5.1.2 
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for  more  discussions).  After  nearly  a  decade,  Khot  and  Saket  [104]  showed  that  NP  Q 
naPCP  i,20/27+e(O(log(n)),  3). 

We  note  that  certain  relaxations  of  the  problem  are  well  understood.  If  we  allow  the 
verifier  to  be  adaptive,  Guruswami  et  al.  [66]  proved  that  NP  Q  aPCPii/2+e(0(log(u)),3).  If 
we  allow  an  arbitrarily  small  loss  of  completeness  for  the  nonadaptive  verifier,  Hastad  [76] 
showed  that  NP  c  naPCPi_e  i/2+e(0(log(u)),3).  Both  of  above  results  achieved  optimal 
soundness  assuming  NP  Q  BPP  [141]. 

We  think  that  Question  5.1.2  addresses  an  important  missing  part  in  understanding 
the  3-query  PCP  systems.  In  addition,  as  is  mentioned  the  answer  to  this  question  is 
equivalent  to  deciding  the  optimal  hardness  of  approximation  ratio  for  satisfiable  3-CSPs. 

5.1.2  Hardness  of  Approximation  and  Khot’s  Conjectures 

The  relationship  between  PCP  and  MAX  3-CSP  is  due  to  the  following  well  known  con¬ 
nection  between  PCP  and  hardness  of  approximation: 

Theorem  5.1.3.  Let  O  be  a  set  of  predicates  with  arity  no  more  than  k.  Following  two 
statements  are  equivalent:  i)  Max  (c,s)  is  NP -hard  .  ii)For  some  NP  complete  language 
L,  there  is  a  PCP  system  with  nonadaptive  verifier  using  predicates  only  from  O  has  com¬ 
pleteness  c  and  soundness  s.  1 

Note  that  the  nonadaptiveness  is  crucial  in  Theorem  5.1.3.  If  the  verifier  is  adaptive  in 
above  theorem,  the  hardness  result  would  hold  only  for  CSPs  with  predicate  set:  “depth  k 
decision  tree  with  predicate  in  at  each  node”. 

As  a  direct  application  of  the  theorem,  we  have  that  Question  5.1.2  is  equivalent  to  the 
following  question: 

Question  5.1.4.  What  is  the  smallest  s  such  that  Max  3-CSP  (1,s)  is  NP -hard  . 

We  can  also  see  why  unless  NP  Q  BPP,  Zwick’s  5/8-approximation  randomized  algo¬ 
rithm  for  satisfiable  Max  3-CSP  [141]  mentioned  earlier  implies  that  the  smallest  s  in 
Question  5.1.2  and  5.1.4  must  be  bigger  than  5/8.  And  Zwick’s  conjecture  is  that  s  =  5/8  +  e 
in  both  questions  5.1.2  and  5.1.4. 

5.1.3  Satisfiable  Max  NTW 

The  optimal  hardness  of  satisfiable  3CSP  is  also  corresponding  to  a  open  problem  in  the 
conclusion  of  the  seminal  paper  of  Hastad  [76].  The  problem  he  asked  is  to  decide  for  satis¬ 
fiable  Max-NTW  whether  there  exists  an  polynomial-time  approximation  algorithm  beyond 
the  random  assignment  threshold  5/8.  Following  is  the  formal  statement  of  Hastad’s  open 
problem: 

Question  5.1.5.  For  any  constant  e  >  0,  given  a  satisfiable  Max- NTW  instance  dP,  is  it  NP- 
hard  to  find  an  assignment  that  satisfies  more  than  an  5/8  +  e  fraction  of  the  constraints  ? 
Question  5.1.5  is  important  as  a  “Yes"  answer  to  it  will  also  resolve  Question  5.1.2  and 

5.1.4  since  Max-NTW  is  a  special  Max-3CSP. 

1Strictly  speaking,  the  hardness  result  in  Theorem  5.1.3  is  only  for  weighted  Max  /j-CSPs.  As  for  Sat¬ 
isfiable  Max  A-CSPs,  the  inapproximability  is  the  same  for  weighted  and  unweighted  instances  due  to  the 
reduction  in  [35]. 
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As  a  result  of  Theorem  5.1.3,  Question  5.1.5  is  equivalent  to  decide  whether  there 
is  such  a  nonadaptive  PCP  system  for  some  NP  Complete  problem  that  the  verifier  has 
perfect  completeness  and  soundness  5/8  +  e  and  it  uses  the  same  predicate  set  as  Max-NTW. 
Constructing  such  a  PCP  system  for  the  d-to-1  Label-Cover  is  the  main  focus  of  the 
remaining  work. 


5.2  Our  Contribution  and  Methods 

5.2.1  Main  Results 

Our  main  result  is  that  we  can  solve  Hastad’s  Open  Problem  (Question  5.1.5)  assuming 
the  Khot’s  d-to-1  Conjecture  hold  for  any  finite  d.  Formally,  our  main  theorem  can  be 
stated  as  follows: 

Theorem  5.2.1.  Assuming  Khot’s  d-to- 1  Conjecture  holds  for  any  finite  positive  integer 
c/,Max  NTW  (1,5/8  +  e)  is  NP  hard.  Equivalently  speaking, there  is  a  3-query  PCP  system 
for  NP  that  has  perfect  completeness  and  soundness  5/8  +  e  for  any  e  >  0  .  In  addition,  the 
verifier  is  nonadaptive  and  uses  the  same  predicates  as  Max  NTW. 

Above  theorem  implies  the  answer  to  Question  5.1.2  and  5.1.4  and  confirms  Zwick’s 
conjecture. 

Corollary  5.2.2.  Assuming  Khot’s  d-to- 1  Conjecture  holds  for  any  finite  positive  integer  d 
and  e  >  0,  Max  3-CSP  (1,5/8+e)  is  NP  hard.  Equivalently  speaking,  NP  c  naPCPi;5/8+e(0(log(«),3). 
Further  assuming  that  NP  ^  BPP,  Zwick’s  5 1 8-approximation  algorithm  for  satisfiable  3- 
CSPs  is  optimal  and  5/8  +  e  is  the  optimal  s  for  both  Question  5. 1.2  and  5. 1.4. 

5.2.2  Methods 

In  the  proof  we  build  a  PCP  system  for  the  d-to-1  LABEL-COVER  that  reads  three  bits  and 
checks  the  NTW  predicate  on  the  literals  of  them. 

The  main  technicalities  of  this  work  are  i)  designing  the  verifier  for  the  d-to-1  LABEL- 
COVER;  ii)  analyzing  the  soundness  of  the  proof  system. 

Our  verifier  can  be  viewed  as  a  generalization  of  the  3-query  dictator  test  in  [121].  The 
dictator  test  in  [121]  generates  queries  from  the  sample  space  {-1,1}"  x  {-1,1}"  x  {-1,1}" 
for  testing  “one  function”.  For  the  use  of  d-to-1  Label-Cover,  roughly  speaking  we  need 
the  verifier  to  address  query  space  {-1, 1}"  x  {— 1,  l}dn  x  {-1,  l}dn  for  testing  “two  functions”. 

In  the  analysis  of  the  PCP  system,  the  main  challenge  (as  usual)  is  to  bound  the  ex¬ 
pectation  of  certain  quadratic  and  cubic  term.  The  problem  is  more  complicated  compared 
with  [121]  and  some  very  different  techniques  are  used  in  the  work.  We  analyze  the 
quadratic  term  in  the  resulting  Fourier  analysis  based  on  some  novel  arguments  about 
the  positivity  of  certain  linear  operators.  For  the  cubic  term,  we  use  the  Invariance  Princi¬ 
ple  style  arguments  similar  to  [115,  116].  However,  none  of  these  invariance  theorems  in 
[115,  116]  can  be  applied  to  our  proof  as  a  black  box  since  our  distribution  does  not  have 
2-wise  independence.  In  addition,  unlike  the  other  proofs  using  the  Invariance  Principle 
that  are  usually  dependent  on  some  properties  in  Gaussian  space,  we  prove  some  invariant 
properties  between  two  distributions  over  Boolean  cube.  These  two  distributions  have  the 


108 


same  (non-zero)  2-wise  correlation  while  Invariance  Principle  helps  to  handle  the  hard-to- 
analyze  3-wise  correlation. 

5.2.3  Related  and  Subsequent  Work 

Related  Work  Most  of  the  hardness  reduction  from  Unique-Games  to  Max-O  prob¬ 
lems  involves  designing  a  “dictator  test”  that  only  uses  predicate  from  <L.  In  [121],  we 
proposed  a  3-query  dictator  test  based  on  the  NTW  predicate  with  soundness  5/8  +  e  and 
completeness  1.  However,  the  test  can  only  be  directly  used  to  build  PCP  system  for  the 
Unique  Label  Cover  and  such  a  proof  system  therefore  does  not  have  perfect  completeness. 

Subsequent  Work  After  our  work,  there  have  been  several  results  built  on  top  of  the 
techniques  developed  in  this  work.  In  [136],  the  authors  studied  the  problem  of  k -query 
PCP  with  perfect  completeness.  Their  main  contribution  is  a  &-query  Dictator  test  with 
perfect  completeness.  While  it  remains  a  open  question  how  to  compose  their  Dictator 
Test  with  proper  outer  verifier.  In  another  recent  work  [137],  the  author  investigated  the 
3-query  PCP  system  over  Zg  with  perfect  completeness  and  obtained  improved  soundness 
assuming  the  d-to-1  conjecture. 

5.2.4  PCP  System  Framework 

The  high  level  framework  of  our  PCP  system  is  similar  to  Hastad’s  construction  for  Max 
3Lin  [76].  Given  is  a  d-to-1  Label-Cover  instance  5£  :  (U  ,V  ,E  ,P  ,Ri,R2,U).  A“proof” 
for  5£  consists  of  a  collection  of  the  truth  table  of  Boolean  functions  for  each  vertex.  More 
specifically,  for  each  vertex  u  eU,  there  is  an  associated  Boolean  function  fu(x) :  [-1,  l}^1  — » 
[-1, 1}  and  for  each  vertex  v  e  V,  there  is  an  associated  Boolean  function  gv(y) :  [-1,  l}^2  — » 
[-1, 1}.  The  proof  contains  all  the  truth  table  of  these  Boolean  functions  and  the  length  of 
the  proof  is  therefore  \U\2Rl  +  |V|2^2.  From  now  on  we  always  view  -1  as  True  and  1  as 
False. 

The  verifier  checks  the  proof  by  following  procedure: 

•  Pick  an  edge  e  =  (u,v)  from  distribution  P. 

•  Generate  a  triple  ( x,y,z )  from  some  distribution  STe  on  [-1,  l}^1  x  [-1,  l}^2  x  {— 1,  l}^2 
(3~e  is  specified  later). 

•  Accept  if  NTW (fu(x),gv(y)),gv(z)))  =  1. 

Folding  The  prover  can  write  the  constant  “1"  function  for  every  fu,gv  and  such  a 
proof  always  passes.  To  address  this,  the  standard  “folding  trick”[19]  is  used  for  our  sys¬ 
tem.  For  example,  for  query  x  =  ( x\,X2,-.xr1 )  on  fu,  the  verifier  always  use  the  value  of 
sgn(xi)fu(l,sgn(xi)x2,..sgn(xi)xn)  (instead  of  Similar  strategy  is  applied  for  query 

y  and  z  on  gv.  Suffice  to  say,  we  can  assume  that  all  the  functions  fu,gv  are  odd. 

For  above  PCP  system,  we  will  show: 

1.  If  opt{5£ )  =  1,  there  is  a  proof  that  passes  the  test  with  probability  1.  (completeness) 

2.  For  any  e  >  0,  if  there  exists  a  proof  that  passes  with  probability  at  least  5/8  +  e,  then 
opt(SP)  >  rj,  where  rj  >  0  is  some  constant  only  depend  on  e  and  d.  (soundness) 
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Assuming  the  d-to-1  Conjecture,  such  a  proof  system  shows  that  NP  c  naPCPi5/8+e(0(logn,3) 
and  a  5/8  +  e  hardness  result  for  approximating  satisfiable  Max  3-NTW. 


5.3  The  Test  and  the  Analysis 

Given  above  PCP  reduction  framework,  the  remaining  of  the  work  constructs  the  distri¬ 
bution  3~e  and  analyzes  the  completeness  and  soundness  for  the  associated  verifier.  The 
reader  is  assumed  to  be  familiar  with  the  basics  of  Fourier  analysis  of  Boolean  functions.; 
see  e.g.,  [134],  As  a  reminder,  our  default  representation  for  bits  will  be  +1  and  -1  rather 
than  0  and  1.  We  use  f(S)  to  denote  the  Fourier  coefficients  of  function  f  on  set  S. 

5.3.1  Idea  of  Constructing  3~e 

Recall  that  the  verifier  first  picks  an  edge  e  =  (u,v).  Then  it  generates  ( x,y,z )  by  distri¬ 
bution  5~e  and  accept  if  M\il(fu(x),gv(y),gv(z))  =  1.  This  section  we  define  the  distribution 
T 

For  the  picked  edge  e,  we  write  di  =  \n~  1(/)|  for  i  =  l,2...i?i.  By  the  property  of  d-to-1 
projection,  we  know  1  <dt<d. 

We  express  fu  :  {-1,  l}^1  — » {-1,1}  as 

fu  :  SC1  X  3d2  X  •  ■  •  X  %R  1  {-1, 1}, 

where  each  SC1  =  {-1, l}{l}  (a  slightly  complicated  way  to  write  {-1, 1}),  and  gv  :  {-1,  l}^2  — * 
{-1,1}  as 

qvRl  -  {-1, 1}, 

where  each  (&l  -  {-1,  l}^1^).  Let  we  also  write  2l  =  {-1,1 

We  construct  Je  as  a  product  distribution:  (nf^/2/1  x  (2/1  x  2l ,STe)  =  x  "3^  x 

2i,STg).  To  define  each  3~g,  we  start  by  defining  some  general  distributions  on  {-1,1}  x 
{-1,  l}m  x  {-1, 1}"\  We  denote  {-1, 1}  x  {-1,  ljm  x  {-1,  l}m  by  3C  x  <3f  x  2  here. 

Definition  5.3.1.  Define  distribution  ^(m)  generating  (x,yi,y2,..ym,zi,Z2,..zm)  as  fol¬ 
lows:  first  x,yi,y2,..ym  are  generated  as  independent  random  bits  and  then  for  each  1  < 
i  <m,  Zi  is  set  to  be  -yix. 

By  definition,  x  and  yi,...ym  are  independent.  In  addition,  the  distribution  is  the  same 
if  we  first  generate  (x,zi,...zm)  and  then  set  y*  =  - XiZi .  Therefore  x  and  zi,..zm  are  also 
independent. 

The  distribution  ^f(m)  is  also  the  basis  of  Hastad’s  construction  of  the  XOR3  verifier. 
Hastad’s  verifier  checked  XOR3 (fu(x),gv(y),gv(z))  =  0  where  ( x,y,z )  is  first  generated  by 
for  STf  =  Jdsidf)  and  then  each  bit  in  ( x,y,z )  is  reset  to  be  some  random  bit  in¬ 
dependently  with  probability  6.  Such  a  PCP  system  has  near  perfect  completeness  and 
soundness  1/2  +  6.  The  random  noise  is  added  to  each  bit  to  make  sure  that  big  parity 
function  passes  with  small  probability. 

We  have  a  similar  situation  here.  Jd’fm)  is  good  for  us  as  {x,yi,zf)  can  only  be  one  of 
(1, 1,-1),  (1,-1, 1),  (-1,1,1),  (-1,-1, -1)  and  these  tripes  are  all  in  the  support  of  NTW.  We 
can  not  add  random  noise  to  Jd’fm)  though  as  we  need  the  perfect  completeness.  Notice 
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that  (1,1,1)  is  also  in  the  support  of  NTW.  We  then  make  a  tweak  on  2C{m)  by  including 
(1,1,1)  as  a  possible  value  for  ( x,yi,Zi ). 

Definition  5.3.2.  Define  distribution  JC{m)  generating  ( x,y\,y2,..ym,zi,Z2,-.zm )  as  fol¬ 
lows.  First,  we  pick  a  random  integer  k  from  1  tom.  Next,  we  generated  x,y\,y2,..yk-i,yk+i,--ym 
as  independent  random  bits.  Last,  we  set  yk,Zk  to  be  equal  to  x  and  for  any  i  e  [m],  i  f  k,  we 
set  Zi  to  be  equal  to  -yix.  Define  distribution  JCgirri)  =  (1  -  8)2C(m)  +  SyY(m);  i.e.,  JCgim) 
generates  ( x,y\,y2,..ym,zi,..zm )  by  JC{m)  with  probability  1-5  and  by  JC{m)  with  proba¬ 
bility  8. 

It  is  easy  to  check  that  the  margin  distribution  of  JC{m),  Jfirri)  and  JCgirn)  on  SC  ,<3/ 
and  2  are  all  uniform. 

We  are  now  ready  to  define  the  STe: 

Definition  of  STe 

Definition  5.3.3.  We  have  STe  where  each  STf  is  set  to  be  JCgidf)  for  i  =  l,2..i?i  and 

8  =  e2/2. 

To  analyze  3~e,  we  also  need  to  define  serval  other  distributions. 

Definition  5.3.4.  Define  distribution  SF{m)  generating  (x,yi,y2,..ym,zi,Z2,--zm)  as  follows: 
first  yi,y2--ym,zi,Z2--zm  are  generated  by  their  marginal  distribution  on  JCirri)  and  x  is 
generated  as  a  random  bit  independent  with  (y\,y2,~ym,z\,..zm)- 

By  definition,  2(m)  and  JC{m)  have  the  same  marginal  distribution  on  f&  x  2 .  Also, 
they  have  the  same  marginal  distribution  on  SC  x  3/  and  SC  x  2  as  x,yi,..ym  and  x,z\,..zm 
are  independent  in  both  2{m)  and  2C{m).  In  addition,  it  is  easy  to  check  JC{m)  and  t2(m) 
have  the  same  “1-wise"  marginal  distribution  (the  uniform  distribution)  on  SC,  3/  and  2. 

We  also  add  the  “tweak"  jV{m)  to  SP{m)  to  define  a  new  distribution: 

Definition  5.3.5.  Define  distribution  SFg  to  be  t2g{m)  =  (1  -  8)t2(m)  +  8Jf{rri). 

It  is  easy  to  see  JCgim)  and  CFgim)  also  have  the  same  “1-wise"  and  “2-wise"  correlation; 
i.e.,  JCgim)  and  t2g(m)  have  the  same  marginal  distribution  on  SC,  3/ ,  2  2  x  <3/ ,  SC  x  2 
and  SF  x  2.  Their  “1-wise"  marginal  distributions  are  all  uniform.  The  3-wise  correlation 
of  JCgim)  and  2g(m)  are  different  though.  Following  lemma  describes  the  difference. 

Lemma  5.3.6.  For  any  function  f  :  SC  — »  IR,  g  :  <3/  — *■  1R,  2  — > ►  IR, 

E  [f{x)g{y)h{z)\-  E  [f(x)g(yMz)\  =  £  (1  - 8)f({l})g(S)h(S)). 

■X’s(m)  &s(m)  |S|is  odd,S^[m] 


Proof.  Recall  that  xs  is  defined  to  be  the  parity  function  on  set  S.  By  writing  each  function 
into  its  Fourier  expansion,  we  have 

RHS=  £  f(U)g(V)h(W)(  E  [Xu(x)xv(y)xw(z)]-  E  [Xu(x)Xv(y)Xw(z)]). 

Lrc[l],yc[m],Wc[m]  jes(m)  &s(m) 

(5.1) 

By  the  definition  of  FCgim)  and  SFgim),  we  know 


(  J?  S-XuWxviyhwiz)]  ~  E  lxu(x)xv(y)Xw(z)]) 

jepm)  SPgrn) 

=  (l-5)(  E  lxu(x)xv(y)Xw(z)] 

jrtrn) 


E  [xuix)xviy)xw(z)]).  (5.2) 

&(m) 


Notice  that  2C{m)  and  2(m)  have  the  same  margin  distribution  on  3/  x  2.  There¬ 
fore  to  make  (5.2)  nonzero,  U  must  be  nonempty  (and  therefore  must  be  {1}).  When 


111 


U  =  {1}  we  have  that  ~E&(m)[xu(x)xv(y)Xw(z)]  =  ^gr(m)\.x\K^m)[xv(y)Xwiz)\  =  0.  There¬ 
fore,  E ji?(m)[xXv(y)xw(z)]  must  be  nonzero  to  make  (5.2)  nonzero.  It  is  not  hard  to  see 
this  happens  only  when  V,  W  are  the  same  set  with  odd  cardinality  and  the  expectation  of 
xxv(y)xw(z)  is  1 .  Therefore, 

(5.1)=  £  (l-6)f([l})g(S)fcS)  E  [xxs(y)xs(zK  =  RHS. 

|S|is  odd, Sc[m]  ^(m) 


□ 


Recall  the  definition  of  Correlation  between  probability  spaces. 

Definition  5.3.7.  Let  (0  x  0,/i)  be  correlated  finite  probability  space.  Define  the  correlation 
between  D.  and  0  to  be 


p(0,0;p)  =  supiCoviy,#] :  f  eA,g  e  B,\ar[f ]  =  Var[g]  =  1}. 

The  conditional  operator  U f  associated  with  pis  a  mapping  from  function  space  [f\f  :  0  — 
R}  to  {gig :  O  — ►  R}  defined  as  follows:  for  f  :  0  — *  R  and  any  x  e  0  and  random  variable 
(X,T)  drawn  from  p,  U^fix)  =  Ey[/‘(T)|.X'  =  x]. 

Since  Cov(f,g )  =  E[(/  -  EyjXg  -  E[g])],  we  can  assume  without  loss  of  generality  that 
E[/]  =  0,E[g]  =  0  when  calculating  the  correlation  between  the  two  spaces;  i.e., 


p(ST,0;P)  =  sup{E[fg] :  f  e  A,g  e  B,E[f  ]  =  0,E[g]  =  0,  Vary]  =  Var[g]  =  1}. 


For  the  distributions  defined,  we  have  the  following  properties  with  proof  in  Sec¬ 
tion  5.5. 

Lemma  5.3.8.  p(5C  x  <2 V  ,2  \JFg{rri))  <  1  -  22d+id2 ■ 

Lemma  5.3.9.  p(9C  \J€gfm))  <  8. 

Lemma  5.3.10.  piSF,^  x  2-,^s(m))  <  \/~8. 

We  comment  that  if  we  did  not  add  the  “tweak  distribution"  JV  to  we  would  have 
that  p{3T  x  <2 V ,jE\J€(rnf)  =  1.  As  for  JFg,  we  still  have  piSC,^  x  _2f;<^?g(m))  =  1  and  this  is 
some  tricky  part  for  our  analysis. 

In  [115]  (Proposition  2.13),  Mossel  proved  that  the  correlation  of  product  spaces  is 
decided  by  the  maximum  correlation  among  all  the  individual  correlated  spaces: 
Proposition  5.3.11.  For  i  =  1,2. .n,  let  (flj  x  0;,p;)  be  a  finite  probability  space.  We  define 
product  probability  space  (Q  x  0,p)  =  n”=1(fy,Qj,/h)-  Then: 

p(0,0;p)  =  max  p(Q.i, Op  pi), 
i 


By  applying  Proposition  5.3.11,  we  know: 


Lemma  5.3.12. 


R\  R\ 

piWx1  xsyn^;^)^1- 

i= 1  i= 1 


82 

22d+ld2 


R  i  i?i 

p(U  ^,11^’^)  <8 

i- 1  i =  1 


R  R  R 

p(rt^,  rkx^ri  &s(di))  <  vs 

i=l  i  =  l  i= 1 
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5.3.2  Analysis  of  the  Verifier 

In  this  section,  we  analyze  the  completeness  and  soundness  of  our  verifier. 


Completeness  Analysis  If  val(L)  =  1  for  some  labelling  L,  we  can  simply  take  fu  ( u  e 
U)  to  be  the  L(u)th  dictator  function  xuu)  and  gv  (v  e  V")  to  be  xuv)-  By  the  definition  of 
3~e,  for  any  edge  (u,v),  we  know  (xL(u),yUv),zL(v))  is  always  in  the  support  of  NTW.  Also 
the  dictator  function  is  odd  and  it  does  not  change  by  the  folding  procedure.  Such  a  proof 
always  passes  with  probability  1. 


Soundness  Analysis  For  any  e,  we  show  if  some  proof  passes  the  test  with  probability 
more  than  5/8  +  e,  then  we  have  opt{££)  >  p.  where  p  >  0  is  some  constant  depending  only 
on  e  and  d. 

First  let  us  we  arithmetize  the  probability  the  proof  passes.  We  have 
PrOiT\}(fu(x),gv(y),gv(z))  =  1)  = 

E  p  ^ili+l(fu(x)+gv(y)+gv(z))+l(fu(x)gv(y)+gv(y)gv(z)+fu(x)gv(z))-lfu(x)gv(y)gv(z)\. 

(5.3) 


By  the  folding  mentioned  in  Section  5.2.4  ,we  know  all  the  fu,gv  are  odd.  Also  notice 
that  3~e  s  1-wise  marginal  distribution  are  all  uniform,  therefore 


E 

e=(u,v)~P  ,3~e 


[§(/uO*0 + gv(y) + gv(z))]  =  o. 


In  the  following  Theorem  5.3.13,  5.3.14,  5.3.26,  we  analyze  the  terms  E [fu(x)gv(y)+ fu(x)gv(z)\, 
Etgvfylgyte)]  and  TZ[fu{x)gv(y)gv(z)\  respectively. 

Theorem  5.3.13.  For  any  odd  Boolean  functions  l}^1  — » {-1, 1}  and  g  :  {-1,  l}^2  — » 


{-1,U 


E  [f(x)g(y))]<8. 

x,y~STe 


Proof.  This  follows  directly  from  the  Lemma  5.3.12  that  Ilf^ ^i,^)  -  8-  By 

definition  for  any  odd  Boolean  function  f,g  (therefore  with  mean  0  and  variance  1),  they 
can  have  correlation  at  most  8.  □ 

By  applying  Theorem  5.3.13  and  notice  the  symmetry  between  y  and  z,  we  have 

E  t \(fu(x)gv{y)  +  fu(x)gv(z))]  <  8/4. 

e=(u,v)~,3~e  8 

In  the  next  section  we  analyze  the  term  gv(y)gv(z). 
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Analyzing  gv(y)gv(z) 

Theorem  5.3.14.  For  any  odd  Boolean  function  g  :  {-1,  l}^2  — *  {-1, 1} ,  we  have 

E  [g(y)g(z)]<S. 

y,z~sre 

Proof.  It  can  be  be  checked  that  p{'Wi,2i,JFs(di))  =  1-8.  We  can  not  use  the  same  simple 
trick  as  in  Theorem  5.3.13.  However,  the  fact  that  g  is  odd  makes  it  possible  for  us  to 
bound  'EyiZ^e[g(y)g(z)l 

First  we  need  define  the  matrix  form  of  the  distribution  on  {-1,  l}m  x  {-1,  ljm  with  the 
Fourier  basis. 

Definition  5.3.15.  Suppose  2?  is  a  distribution  on  (-1,  l}m  x  {-l,l}m.  A  2m  x  2m  matrix 
M(2?)  is  defined  as  follows:  let  us  use  all  the  subsets  of  [m]  to  index  number  from  1  to  2m. 
The  M(2?)  has  the  following  form.  For  any  S  Q  [ m\,T  c  \m\,  the  element  M{2?)s,t  at  S  row 
T  column  is  E (y,z)~&>[xs(y)XT(z)]. 

We  can  also  identify  function  g  :  {-1,  l}m  — »  K  with  a  row  vector  in  IR2”  that  contains 
the  entire  collection  of  g’s  Fourier  coefficients.  The  Fourier  coefficients  are  arranged  in  the 
same  order  as  their  associated  sets  in  Definition  5.3.15 

For  example,  when  m  =  2,  the  subsets  of  {1,2}  are  of  the  order  0,{1},{2},{1,2}.  If  the 
distribution  generates  (yi,y2,zi,Z2)  £  {— 1,  l}2  x  {— 1,  l}2  as  follows:  y\,y2  is  generated  as 
independent  random  bits  and  zi=  yi  for  i  =  1,2.  Then  such  a  distribution  has  the  following 
matrix  form: 

10  0  0 
0  10  0 
0  0  10' 

0  0  0  1 

And  we  write  down  function  g’s  vector  form  as: 

g(0) 

g«1}) 

8  g({2})  ' 

g({l,2}) 

With  the  new  matrix  notion,  we  can  write  the  product  of  two  functions  as  the  multipli¬ 
cation  between  vectors  and  matrix: 

E  [f(y)g(z)\=  £  f(S)g(T)E[xs(y)XT(z)]  =  fTMmg. 

y’z~&  r,se[m]  9 

Following  lemma  is  easy  to  check. 

Lemma  5.3.16.  i)  2? \  and  2? 2  are  two  distributions  on  {-1,  l}m  x  {-1,  l}m.  If  2?  =  c^i  +  (1  - 
c)2A 2,  then  M(2?)  =  cM(2? 1)  +  (1  -  c)M(2? 2). 

ii)  2? 1  and  2? 2  are  two  distributions  on  {-l,l}mi  x  {-l,l}mi  and  {-l,l}m2  x  {-l,l}m2.  If 
2A  =  2P 1  x  2? 2,  then  M(2?)  =  M(2P 1)  ®  M{2? 2). 
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We  define  the  identity  distribution  J^(m)  on  {-l,l}m  x  {-1,  l}m  as  follows:  y\,..ym  are 
first  generated  as  independent  random  bits  and  Zi-yi  for  every  i.  It  is  easy  to  check  that 
M(J’im))  is  the  identity  matrix. 

Now  we  are  ready  to  prove  Theorem  5.3.14.  First  let  us  write  ~E,y^3-e\_g{y)g{z)]  by  the 
multiplication  of  the  vector  form  of  g  and  the  matrix  M(3~e): 

Ei 

E  [g(y)g(z)\  =  gTM{STe)g  =  gT  ®M{je5{di))g. 

y,z~3~e  ^ 

Define  distribution  J^(m)  =  SJ'im)  +  (1  -  8)J€irri).  We  will  show 

i= 1  i=l 

To  see  this,  we  first  need  prove  the  following  lemma: 

Lemma  5.3.17.  MiJ’gim)) ^{J’gim))  -  MiJPgim))  and  M(Jtg(m))  +  M(JPg(m))  are  all  pos¬ 
itive  matrices. 

Proof.  1)  To  show  M(J^(m))  is  positive:  Recall  that  J^(m)  =  SJ'im)  +  (1  -  8)JP(m).  J’lm) 
is  the  identity  matrix  which  is  positive.  It  is  easy  to  check  that  M(J€{rrij)  is  a  diagonal 
matrix  with  following  diagonal  elements:  for  any  odd  set  S  Q  [m],  M{JP{m))s,s  =  0  and  for 
any  even- cardinality  set  S  Q  [m],  M(JP(m))s,s  =  1-  M{JP{m))  is  also  positive.  Therefore, 
MiJ'g)  is  positive. 

2)  To  show  MiJ'slm))  -  M{JPg{m))  is  positive:  Since  we  know  M(j^(m))-M(.^?g(m))  = 

5(M(j^(m))  -  M(jY(m))),  we  only  need  to  show  -  MMO)  is  a  positive  matrix. 

Notice  that  for  any  function  h:{- 1,  l}d  — ■  IR  , 

rp  (h(y)2  +  h(z)2 

hT M{jV{m))h  =  E  [(h(yMz)]  <  E  P  ^  o  ] 

^Y(m)  JT(m)  2 

Notice  that  ,N(m)  and  ,0(m)  have  the  same  marginal  distribution  (uniform  distribution) 
on  y  and  z.  Then 

E  [^!±^!]=  E  [(h(yf]=  E  [(h(y)2]  =  hTM{Jt)h 
jV(m)  2  ^V(m)  J?(m) 

This  implies  for  any  h,  hT (M(J^ (m))  -  MiJV' \m)))h  >  0.  Therefore,  M{J?{m))- M{Jf{m))  is 
a  positive  matrix. 

3)  To  show  MiJ'sim))  +  M(JPg(m))  is  positive:  We  know  MiJ’gim))  +  MiJPgim))  =  2(1- 
8)M{JP(m))  +  8(M{^)  +  M(JP(m))).  We  already  know  M(JP{m))  is  positive.  It  remains  to 
prove  that  M(J^)  +  M(JP(m))  is  positive.  Notice  that 

^  ( h(y)2  +  h(z )2 

-  hT M(jV(m))h  =  E  [-(/i(y)/i(z)]  <  E  [— - —  ] 

^Y(m)  jV(m)  2 

=  E  [(/i(y)2]=  E  [(/i(y)2]  =  hTM(j?(m))h.  (5.4) 

JT(m)  J’i.m) 

Then  we  have  that  M(J^(m))  +  M{J/{m))  is  a  positive  matrix.  □ 
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We  have  shown  +  MiJtfgim)),  and  MiJ’sim))  -  MiJdsim))  are  all 

positive  matrix.  By  Lemma  5.6.1  this  implies  ®\=1M(J’g{di))  -  is  a  posi¬ 

tive  matrix  for  any  integer  k  >  1.  Therefore,  <S>fj:1M{^g(di))  -  (^^MiJd’gidi))  is  positive. 
By  the  property  of  positive  matrix,  we  have 


R\  R\ 

gT®M(je’8(di))g<gT<g)M(j?8(di))g. 


i- 1 


i- 1 


Now  we  calculate  gT  (g)^  M (,#g{d j ))g  by  g’s  Fourier  Coefficients.  Notice  that  M(j^(m)) 
is  also  a  diagonal  matrix:  for  any  odd-cardinality  set  S  Q  [dj,  M(J*s(m))s,s  =  <5  and  for  any 
even  size  set  S,  M(J^s(rn))s,S  =  1-  Then  ®i^1M(J^s(di))  is  also  diagonal  matrix;  i.e.,  for 
S,fc  [i?2],S  ^  T,  we  have 

E  (xs(y)xs(z)]  =  0. 

Also  notice  that  g  only  has  Fourier  Coefficients  on  odd-cardinality  set,  we  can  expand 
gT ®f:}1M(J?8(di))g  as: 


E  p  E  tosOOxste)]. 

Sc[R2],ISI  is  odd 

The  term  E  jj  JxsOOxste)]  can  be  further  written  as: 


R  i 

][I^E 

Since  S  is  odd-cardinality  set,  there  must  exist  some  io  (1  <  io  <  Ri)  such  that  the  inter¬ 
section  set  between  S  and  n^ii)  has  odd  cardinality.  Recall  that  for  odd-cardinality  set  S, 
MiJ^sidt^S’S)  =  d.  We  know  therefore 


E  lxs(;y)xs(z)]  = 

n i^Ssidi) 


R  i 

FI  ^diJ-X(Snn;Hi))WX(Snn;Hi))(zfi 


-H(d  )^('Sn7re1fio))^^(Snwe1(io)) 


(z)]  =  mss(di)Y 


Smie  1(io),Smie  1(io) 


8.  (5.5) 


This  implies 

E  Vg(y)g(z)\<8  E  g(S)2<8. 
se[R2] 


□ 


Overall,  we  have  proved  that 

E[g(y)g(z)]  <  E  Vg{y){z)\  <  8. 

and  therefore  E u,v~P,y,z~3~e\-gv(y)gv(.z)]  is  also  bounded  by  8. 
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So  far,  We  have  shown 


E \l(fu(x)gv(y)  +  gv(y)gv(z )  +  fu(x)gv(z))\  < 

or  o 


In  the  next  section  we  analyze: 


E  i-~fu(x)gv{y)gv{z)}. 

(u,v)~P  ,x,y,z~Oe  8 


Analyzing  fu(x)gv(y)gv(z) 

We  will  first  describe  some  new  Fourier  analysis  tools  .  Recall  the  definition  of  influence 
on  coordinate. 

Definition  5.3.18.  Given  function  f  :{- 1,  1}"  — >■  R,  i  £  in],  we  define  the  influence  of  i  on  f 
to  he 

InW)  =  E  A*S)2- 

S3i 

In  this  work  we  also  define  the  influence  on  set: 

Definition  5.3.19.  2  For  a  function  f  :{— 1,  1}"  [ n ],  define 

InW)  =  E  f(S)2. 

S^T 

Recall  the  definition  of  Bonami-Beckner  noise  operator  Tp. 

Definition  5.3.20.  (D.,p)  =  gi)  is  finite  product  probability  space.  For  x  =  (x\,..xn)  £ 

O.,  random  variable  y  =  (yi,  ..yn)  e  D.  is  called  p  correlated  with  x  if  each  yi  is  independently 
set  to  be  xt  with  probability  p  and  set  to  be  a  random  sample  from  ( D.i,pi )  with  probability 
(1  -p\ 

The  Bonami-Beckner  Tp  is  a  function  mapping  from  {f  :  H  — *■  IK}  to  {g  :  Q  — *■  IR}  defined 
as  follows:  g(x)  =  Tpf(x)  =  E[/(y)],  where  y  is  a  random  variable  p  correlated  with  x. 
Remark  5.3.21.  In  this  Chapter,  we  assume  pi  to  be  the  uniform  distribution  unless  addi¬ 
tion  explanation,  e.g.,  when  we  define  Bonami-Beckner  Operator  on  (-1,  l}n,  we  refer  to  the 
product  probability  space  ({-1, 1},  uniform  distribution  on  {-1,1})". 

Also  Bonami-Beckner  Operator  is  dependent  on  how  we  write  the  product  space.  For  ex¬ 
ample,  {-1,1}^2  and  nfi i(—  1,  l}71  el(l)  have  different  Bonami-Beckner  Operator.  By  {-1,1}^2, 

we  mean  product  of  spaces  fl;  =  {-1, 1 }  for  i  =  1,2.  .12 2/  by  ]T  V-!,!}7*6  u;  we  mean  the  prod¬ 
uct  of  spaces  Qj  =  {-1,  l}7^1^)  for  i  =  1,2.  .12 1.  By  definition,  the  Bonami-Beckner  operators 
are  different  in  above  two  cases.  In  this  work  we  mostly  use  the  first  operator  unless  addi¬ 
tional  explanation. 

It  is  well  known  fact  that  the  total  influence  over  all  coordinates  for  “smoothed  func¬ 
tion”  T\-yf  is  bounded.  We  generalize  it  by  proving  that  for  “smoothed  function”,  its  total 
influence  on  all  constant  size  sets  is  also  bounded. 

Lemma  5.3.22.  For  any  Boolean  function  f  :{- 1, 1}"  — » [-1, 1]  and  y  <  1,  m  £  N, 

_ _  Yfl 

E  Infs(T^rf)<(—r. 

2The  definition  here  is  different  from  [83]. 


117 


Proof.  Notice  for  every  set  S  Q  \n\,  it  contains  L™  0  ('f ')  subset  with  size  smaller  than  m. 
We  know  that  L™0  ('f ')  -  (l<S'|  +  l)m  (imagine  the  process  we  select  m  times  from  |S|  element 
and  every  time,  we  chose  to  select  nothing  or  one  of  the  |S|  element).  Then  we  have 

£  (InfsTi_rg)<  £  (|S|  +  ir(l-r)2|S|/(S)2.  (5.6) 

Sg[n]  Se[n] 


With  the  inequality  from  Lemma  5.3.23,  it  can  be  shown  (|S|  +  l)m(l  -  y)2|S|  <  (f^)m 
therefore 


771 

(5.6)  <(— r 

2  7 


E  As)2 

SeM 


:(-r. 

2y 


and 


□ 


Lemma  5.3.23.  For  y  >  1/2,  x  >  0,m  e  N,f(x)  =  (1  -y)2x(x  +  l)m,  we  have  fix)  <  ( m/y)m 

Proof.  Notice  that  1  —  y  <  e~r,  we  have  f{x)  <  e~2xr(x  +  l)m  =  h(x).  By  some  calculus,  h(x) 
reaches  its  maximum  when  x  =  -  l.Then  we  have  fix)  <  h(^)  <  i^)m.  □ 

In  this  work  we  extend  the  hypercon tractive  inequality  into  the  following  form: 
Lemma  5.3.24.  Let  function  f  :{- 1,  l}n  — *  [-1, 1]  and  0  <  y  <  1,  then 

2+2  y 

l|Tl-r/,|l3<ll/'ll2“ 


Proof. 

WT^yfh  =  E[|T1_r/‘(x)|3]1/3  <  E[|T1_r/(x)|2+2r]1/3 
Notice  that  (1  -  y)  <  we  can  use  hyper- inequality  and  get: 

2+2y  2+2y 

E[[T!_ryi2+n1/3  =  m-yfw^  <  ||/-||2“ 

n 


As  a  corollary,  we  have 

Corollary  5.3.25. 

2+y 

||Ti_r/'||3<l|(T(1_o.5r)/)ll2~. 


Proof.  Since 


2+y 

\\Ti-rf\\3  <  ||T,1_r+r2/4/'||3  =  l|T(i_o.5r)T'(i_o.5r)/‘)||3  ^  ll(T(i-o.5r)/‘)ll23  • 


□ 


Now  we  are  ready  to  analyze  fuix)gviy)gviz).  Following  is  the  key  theorem  we  need: 
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Theorem  5.3.26.  There  exists  some  positive  constant  y,r  depending  only  on  d,8  such  that 
for  any  odd  Boolean  functions  f  :{- 1,  l}^1  — » {-1, 1}  and  l}^2  -*  {-1, 1},  if  for  every 

l<i<R\  and  odd- cardinality  set  S  Q  n~  Hi), 


mm(InfiT(i_o.5r)/‘,InfsT,(i_o.5r)^)  <  t, 


then 


E  [  f(x)g(y)g(z)\ 

x,y,z~3~e 


<3  VS. 


Proof.  The  first  idea  is  that  we  can  apply  some  smooth  operator  to  f  ,g  and  the  expectation 
would  not  change  too  much. 

Formally,  we  claim  there  exists  some  positive  constant  y',y  (jr  >  y  >  0)  depending  only 
on  6  and  d  such  that 


I  E  [f(x)g(y)g(z)\-  E  [Ti_ 

x,y,z~STe  x,y,z~2Te 


,/(x)Ti_yg(y)Ti_r/g(z)]|  <  V6, 


Above  claim  is  proved  by  Lemma  5.4.2. 
From  Lemma  5.3.12,  we  know 


(5.7) 


Ri  R  i  R  i 

P(  n  n^x^;n  -w.))  ^  ^ 

j=l  i= 1  i=l 

Therefore,  notice  that  ’E\Ti-rf{x)\  =  0  and  both  Ti-rf(x)  and  T i_y g{y)T i_y g{z)  are  bounded 
in  [-1, 1]  and  they  have  variance  less  than  1.  Therefore,  we  have 

I  E  [Ti-jfWTi-ygiyWx-ygiz)]  =  |Cov(Ti_r/‘(x),Ti_r/g(y)Ti_r/g(2))|  <  \/d. 

x,y,z~n 

(5.8) 

Following  we  will  prove  the  expectation  of  the  product  of  “smoothed"  function  (Ti_r/')(Ti_r'g)(Ti_r0 

R  R 

is  very  close  on  distribution  3~e  =  11  il^^Csidf)  and  II ^i^sidi).  Formally,  we  will  show: 

I  E  [Ti-Yf  {x)T i-y g{y)T i-y g{z)\  -  E  [T\-yf{x)T1-yg{y)T1-yg(,z)\\<\/8 

nfijWi)  Te=n%^s(di) 

(5.9) 

Combining  (5.7), (5.8), (5.9),  we  then  prove  E X,y,z~$-e[f(x)g(y)g(z)]  <  3V8 
It  remains  to  prove  (5.9).  The  technique  for  the  proof  of  (5.9)  is  similar  to  the  Invari¬ 
ance  Principle  proof  in  [116].  Roughly  speaking,  we  show  every  time  we  can  change  the 
distribution  at  one  coordinate  from  JCgidi)  to  ^s(di)  and  the  change  will  be  bounded  by 
the  influence  at  that  coordinate.  And  then  we  use  the  fact  that  for  “smoothed"  function  the 
total  influence  is  bounded. 

To  clarify,  let  us  start  by  changing  the  distribution  at  the  first  coordinate  from  JCgidi) 
to  ^s(d i).  Without  loss  of  generality,  let  us  assume  that  1  ( 1)  =  {l,2,..di}.  Let  us  write 
x!  for  (x2,..x„)  and  /  for  (yd1+i, ..yRz)  and  z’  for  (zdl+i,.zR2). 

We  can  think  of  f  as  a  function  only  on  variable  x\  and  write  f  by  its  Fourier  expansion 
as  : 

fix)  =  F0(x')  +  x\F{i}(x'). 
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Similarly,  we  can  view  g  as  a  function  only  on  variable  yi,..ydj  and  we  denote  g’s  Fourier 
expansion  (on  yn-.y^)  as 

8=  E  xsiy)Gs(y')- 

Ss[di] 

For  any  S  Q  [c?i],  we  can  also  represent  Gs  by  g’s  original  Fourier  coefficients  g(Q)  ( Q  Q 
[i?2])  as  follows: 


Gs(y')=  E  [g(y)xs(y)]=  E  g(Q)XQ\s(y')- 

yi’-ydi  Qe[B2],Qn[di]=S, 

We  know  therefore  Gs(y')  =  Ey^.^CglylysCyXI  is  always  bounded  between  [-1,1].  Sim¬ 
ilarly  F{i}  =  LieQ  XQ\{i}(;x:/)  and  it  is  also  bounded  in  [-1, 1], 

It  is  easy  to  see  T\-rf  has  Fourier  Expansion  Ti-jF^  +  xi(l  -  y)Ti_r.F{i}  (on  variable 
xi)  and  Ti-yg  has  Fourier  Expansion  LssWilXsd  -yO|S|Gs  (°n  variable  yi^.y^). 

By  Lemma  5.3.6,  if  we  take  conditional  expectation  only  on  (x,yi,..yd1,.zi...zd1),  we  have 

E  [T\-yf(x)Ti-yg{y)T\-yg(z)\  -  E  [Ti_r/Xx)Ti_r/g(ym_r'g(2) 

=  -(1  -  <5)  Ed  -  7)(1  -  r'f^Ti-yFwix'W^yGsiy'm-yGsiz') 
s 

Further  condition  on  x',y',z',  we  can  calculate  the  difference  of  changing  the  first  coor¬ 
dinate  as  follows: 


E 


VT i-yf  (x)T i-yg(y)T i-r'g{z)\  - 


E 

n  H^sGk) 


VT  i-yf  (x)T  i-yg(y)T  i-y'g{z)\ 


=  E 

n  llz&sm 

=  -0.-8) 


E  \Ti-yf(x)T i_yg(y)T i-yg(z )]  -  E  [Ti_r/'(x)T1_r^(y)T1_r/g(2)] 
J?s(d  i)  '  &s(d  i)  '  '  ' 


E  (l-y)(l-y'> 

|S|  is  odd,S£[cfi] 


/\2|S| 


E  [T1.rFll}l(x')T1.yGs(y')T1.yGs(z')]. 


nkl2-xs(dk) 


(5.10) 


Notice  that  flfio  J€s(.di)  has  uniform  marginal  distribution  on  x',y' ,z' .  We  have 


li= 2 


E  [T i-yFiTi-yGs T i_r'Gs]  <  WT^yF^MTx-yGsWl  (Holder’s  Inequality) 


n 


2+y 

“3“ 


4+2/ 


^  IITi-o.syEdillg3  ||Ti-o.5r/Gs||2  3  (Corollary  5.3.25) 
By  representing  Gs  by  g’s  original  Fourier  coefficients,  we  have 

m-0.5yd3sllHlTi_o.5y'  E  ^(Q)7Q\slll=  E  d  ~  0.5r')2|Q|-2|S|g(Q)2 

Q:Qn[di]=S  Q:Qn[di]=S 


And  similarly, 


II T 1-0. 5yE {1}  II 2  ^ 


InfiTi-o^y/" 

(l-0.5y)2 
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We  can  then  bounding  I  (5.10)  I  by 


(1-d)  X  (i-y')2|Slu-r) 

|S|  is  odd,iSQ[di] 


l+0.5y 

'Infiri.o.syn  3 

.  (l-0.5y)2 


/'InfgTi-0,5y>g\ 

(l-0.5y')2|S|  J 


2+7' 

3 


1+0.57  2+7' 

-  E  (InfjTi-o.sy/-)  3  (InfgTi_0.5r'g)  3 

|S|  is  odd,Se[di] 


Take  r  to  be  ( 2(rf/o  5y)rf  )r  anc^  no^ce  that  7-7  (which  implies  InfsTi_o.5r'g  >  InfsTi_o.5rg), 
we  have 

min(InfiT'i-0.5r/',InfsTi_o.5r'^)  <  r 

and  therefore 

l+0.5y  2+y'  /a  1  2 

(InfiTi_o.5r/)  3  (InfsTi_0.5T'g)  3  <  ty  (InfiTi-o.5rf)3(InfsTi-o.5yg)3- 

Recall  d\  <  d,  then  we  can  bound  the  difference  of  changing  the  first  coordinate  from 
Jd’sidi)  to  &s{di)  by 

Tr/6  £  (InfiTi_o.5r/‘)^(InfSTi_o.5r^)i  <Tr/6  E  (Inf1Ti_o.5r/’+InfST1_o.5r^) 

|S|  is  odd,Se[di]  |S|  is  odd,S£[e?i] 

<  Tr/6(2d_1InfiTi_o.5r/’  +  £  InfsTi_0.5rg)  (5.11) 

|S|  is  oddjSeid!] 


Similarly  calculation  will  show  that  for  any  i, 

I  E  [Ti_o.5r/(x)T1_0.5r'g(y)T'1_o.5r'g(2)] - 

nU  Wk )  x  n*ii+1  je$(dk) 

E  [T  1-0.5  yf  (x)T i-0.5  yg(y)T  1-0.5  y'g(z)]  \ 

nUi^(d*)xn*i-*Jw*) 

<  Tr/6(2rf_  1inf j  T i-o.5y  /  +  £  InfsTi_0.5rg) 

|S|  is  odd,Se^:_1(i) 

If  we  sum  above  inequality  over  i  from  1  to  22 1,  we  have 

I  E  [T'i_o.5r/’(x)Ti_o.5r'g(y)Ti_o.5r'g(2)]  -  E  [Ti-o,Qrf{x)Ty-05yg(y)Ti-o5r'g(z)]]\ 

n  fJi^aW/)  3-e=n  d^sidi) 

Ri 

<  rr/6£(2d-1Inf,Ti_o.5r/+  E  InfSTi-o.5rS) 

i= 1  SE7r_1(i) 

<  rr/6(2d“1(l/y)  +  (d/y)d)  (By  Lemma  5.3.22) 

<  rr/6(2  (d/y)d)  =  V8. 

This  proves  (5.9).  □ 
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Soundness  Proof 

Now  we  prove  the  following  soundness  theorem. 

Theorem  5.3.27.  For  any  e,  if  some  proof  passes  with  probability  more  than  5/8  +  e,  then 
we  have  opt{5£)  >  77.  Here  77  >  0  is  some  positive  constant  only  dependent  on  e  and  d. 

Proof.  Recall  that  e  =  2  VS.  Suppose  some  proof  pass  the  test  with  probability  5/8  +  e,  then 


E  [I  +  kfu(x)  +  gv(y)  +  gv(z ))  +  Ufu(x)gv(y)  +  gv(y)gviz)  +  fu(x)gv(z)) 

e=(u,v)~P ,2Te  8  8  8 

- 1 fu(x)gv(y)gv(z)]  >  |  +  2 Vs. 

By  the  oddness  of  fu,gv  and  Theorem  5.3.13,  5.3.14,  we  know 

E  [j>+Ufu(x)+gv(y)+gv(z))+Ufu(x)gv(y)+gv(y)gv(z)+fu(x)gv(z))]  <  <  ^+|v/d. 


e=(u,v)~P  ,3~e 


8  8  8  8 


Therefore, 

I  E  [fu(x)gv(y)gv(z)]\>^-kS>4kS. 

e=(u,v)~P,3~e  3 

Then  by  an  average  argument,  for  vd  fraction  of  the  edges  (u,  v),  we  have 

E  [fu(x)gv(y)gv(z)]  >  3 VS 

We  call  these  edges  “good”.  By  Theorem  5.3.26,  we  know  for  every  “good”  edge  ( u,v ),  there 
must  exists  some  i,  and  odd  set  S  ^L(i)  such  that: 


min(InfjT(i_o.5r)/’u,InfsT(i_o.5r)gt;)  >  r.  (5.12) 

We  can  define  the  following  randomized  label  strategy  for  5£ : 

For  u  eU,  define 

Su  =  {i|Infj(T'i_o.5r/’M)  >  t} 

and  v  e  V,  define 


Sv  =  l/l./  g  S,Infs(Ti_0.5r^i;)  >  t,  \S\ <  d,  \S\  is  odd}. 

Given  (10.24),  for  good  “edges”  ( u,v ),  SU,SV  must  be  both  nonempty  and  there  exists  some 
i  e  Su  such  that  ne(i)rSv  f  0. 

Also,  by  Lemma  5.3.22,  we  know  that  LsInfslTi-o.sygy)  ^  {d/y)d .  Therefore,  the  num¬ 
ber  of  S  that  satisfies  InfslT’i-o.segy)  >  t  is  at  most  (d/j)d/T  and  therefore |S„|  <  d(d/y)d/ t. 
Similarly,  we  have  |SJ  <  l/(yr). 

For  every  vertex  w  e  U  u  V”,  our  labelling  strategy  is  to  randomly  pick  a  label  from 
Sw  for  w.  We  know  for  ever  “good  edge”,  they  are  satisfied  by  probability  at  least  jsjjs/j- 
Overall,  our  randomized  strategy  satisfies  at  least 


v/d 

\su\\sv\ 


vsP/m2 

d 


fraction  of  all  the  edges.  Notice  here  77,  y,  t  are  positive  constant  depending  on  8  and  d. 


□ 


122 


5.4  Noise  Operator 


For  the  product  space  {-1,1}"  with  uniform  distribution  at  each  coordinate,  it  is  easy 
to  check  that  Fourier  Expansion  is  the  Efron-Stein  Decomposition  ;  i.e.,  f  =  Lssw/sOxO 

where  fs(x)  =  f(S)xs(x).  It  is  also  easy  to  check  for  the  product  space  nfji  ({-1,  l}7^1^ 

with  uniform  distribution  on  each  {-1,  lj^e1^),  ^(x)  =  Y.sqR2  fs(x)  with  fs(x)  =  Ltt e(T)=S  f(T)XT(x) 
is  the  Efron-Stein  Decomposition. 

Following  Lemma  is  proved  in  [115]  (Proposition  2.12) 

Lemma  5.4.1.  Let  (D  x  0,/i)  =  n"=1(fL  x  ©i, yu^)  be  a  finite  product  probability  spaces. 

And  <  pt.  Suppose  f  :  0  — > ►  R  has  the  Efron-Stein  Decomposition  'Ls^nfs  on 

n'2=1(Dj,/ii).  And  let  be  the  conditional  operator  of  p  mapping  function  f  :  0  — ►  IR  to 
g  :  D  — *■  K,  then 

ll^/,sll2<|nPi)ll/'ll2- 

Now  we  prove  that  for  our  distribution  3~e,  the  expectation  of  f(x)g(y)g(z)  is  closed  to 
its  smoothed  version  Ti_ri  f(x)Ti-rg(y)Ti-rg(z)  for  some  small  constant  y,y'. 

Lemma  5.4.2.  Let  f  be  a  function  mapping  from  {-l,l}Bl  — ►  [-1,1]  and  g  be  function 
mapping  from  {-1,  l}^2  — »  [-1,1]  .  For  any  small  constant  f  >  0,  let  po  =  1  -  22d+id2  >  7  = 

^-(1  -  po),  7  =  2~’  we  have: 


E 

x,y,z~2Te 


[Ti_r//,(x)Ti_rg(y)Ti_rgU)] 


E  [f(x)g(y)g(z)\  <  3/3. 

x,y,z~2Te 


Proof.  By  Lemma  5.3.8  ,  we  know  p(SL  x  Z\3~e)  <  1  -  22d+id2  =  Po- 

Let  us  write  the  Efron-Stein  Decomposition  for  g{z)  on  WA^Z1  ,Tle)  as  Lscyjjgste) 
and  we  know  gs(z )  =  Hn(T)=s §(S)xt(z).  We  also  write  (f(x)g(y))’ s  Efron-Stein  Decompo¬ 
sition  on  Wi=i{SPl  x  (3/1  ,STf)  as  Lsc[i?i]-P1s(^,y)-  Let  Usre  be  the  conditional  operator  asso¬ 
ciated  with  correlated  probability  space  x  <3/1  ,\\^f1Zl  ,2Te).  We  denote  I  as  the 

identity  operator.  Then  we  have 


E  [f(x)g(y)g(z)]  -  mf(x)g(y)T1-rg(z)]  =  E  [f(x)g(y)(I  -  7Vr)g(z)] 

=  Y  ElFs(x,y)Ug-e(I  -T1-r)gs(x,y)\  (By  the  orthogonality)  (5.13) 
S<=[iL] 

It  is  easy  to  check  that  for  the  Efron-Stein  decomposition  of  function  (7  -  Ti-r)g,  we  have 

(C7  -Ti-r)g)s  =  (7  -  T1_r)(gs). 


Then  Lemma  5.4.1,  we  have  that 


\\Ug-e(I  ~  Ti-r)g  sh  ^  p{flll(7-Ti_r)^sll2. 

Denote  ne{Q)  as  {i\ne(i)  e  Q}.  As  ne  is  d-to-1  projection,  we  have  |Q|  <  \S\d  .  For  any 
Q  Q  [7?2],  We  know 

11(7 -  T\-j)gsw\  =  £  (l  - (l - yfmfg(Q)2XT  <  (1  - (1  - y)2|S|rf)2ll^sll2- 

QQ[R2\,ne(Q)=S 
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Therefore, 


||t/5-e(7-T1_r)^s||2<pJflv/l-(l-r)2ISId)||^||2<min(plf',v/l-(l-r)2ISId)||^s||2. 
When  \S\ >  we  know  pjf 1  <  (5.  When  |S|  <  we  have 

l-(l-y)2|S|d  =  l- 
Therefore, 


:  rd-Po)] 

d 


2|S|d 


o  log|6 

<i-a-p3a-po))  losp°  =o(p2i 


min(pLS|,  yj  1  -  (1  -  y)2|S|rf)  <  p 


By  Cauchy-Schwarz,  we  get: 


(5.13)  <  /  £  WFsWlZuUr.V-T^gsUl*  pl  £  \\FsU*  £  UgsUl  *  P  (5-14) 

y  sc[j?ii  y  seti?!]  seti?!] 

If  we  apply  the  same  calculation  above  when  treating  f(x)Ti-rg(z)  as  a  whole  and 
notice  that  p(3d  *  3, <3/ ;3Te)  <  po,  we  would  get: 

|  E[/ (x)T i-rg{y)T i-rg(z)]  -  E [{f{x)g(y)T1-rg{z)} |  <  yS  (5.15) 


It  remains  to  show 

I  E  [f  (x)T i-rg(y)T i-rg(z)]  -  E[T1-r'f(x)T1-rg(y)T1-rg(z)]\  <  /3. 

e  e 

However,  we  can  not  apply  the  same  trick  again  as  that  p(SCl ,(3/1  x  JEl  ,J?s{di))  =  1. 

Recall  the  definition  of  the  Bonami-Beckner  operator,  we  can  rewrite  Eg-e[f(x)Ti-rg(y)Ti-rg(z)\ 
as  ~E.(g-ey\f{x)g{y*)g{z*)\  where  ( 5~e)*  is  the  distribution  as  follows:  first  we  generate  x,y,z 
by  distribution  3~e  and  then  we  independently  reset  each  bits  in  y,z  with  some  indepen¬ 
dent  random  bit  with  probability  y  and  get  y*  ,z* . 

Recall  that  3~e  -  II  where  each  distribution  3Tel  (on  3£l  xf‘x  2l)  is  set  to  be 

J^gicLi).  It  is  easy  to  check  that  (3~e)*  =  II where  (3~e1)*  =  ( di )  is  the  distribution 
such  that  we  first  generate  (.x,y\..ydi,z\..Zdi)  by  J^s(di)  and  then  we  independently  reset 
every  coordinate  yt  and  Zi  to  be  some  random  bit  with  probability  y. 

Now  we  will  show  piSZ1,^1  x  di))  <  (l-y2d/2).  By  the  definition  of  Jd’gidt), 

there  is  probability  ( y)2di  such  that  yi,Zi  are  all  reset.  When  this  happen,  x  is  independent 
with  y,z.  We  call  V  the  event  that  “yi,Zi  are  all  reset".  Then  we  have 

piSC1,^1  xZi-je*5{di))=  sup  (l-y2di)E[/Xx)G(y,z)|V] 

f,G,E[/]=E[G]=0 

E[/-2]=E[G2]=1 

+  y2d'E[f(x)G(y,z)\V]  (5.16) 

Notice  that  event  V  is  independent  with  x,  we  have  E[/‘(x)G(y,2)|V]  =  E[/(x)|V]E[G(y,2)|V]  = 
E[/(x)]E[G(y,z)|V]  =  0. 
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Also  since 


1  =  E [G2]  =  (1  -  y2d'  )E[G2|V]  +  r2d‘  E[G2|V] 
we  have  E[G(y,2)2|V]  <  1/(1  -y2d‘).  Therefore 

p(3£i,&ixZi;^(di))  =  a-r2di)Wnx)G(y,z)\Vl<a-r2di)\/nf2\V]mG(y,z)2\:Vl 

<  (i  -  j2di)yjE[f2va  -  r2do  <y/i-  r2di  *  d  -  r2d/ 2)  (5. 17) 

We  have  shown  p(3Pl  ,(3/1  x  Xl\J€g{dff)  <  (1  —  y2o?/2).  By  applying  Proposition  5.3.11, 

p(  n  sc1,  n  ®ri  x  ^ 1  -  r2d/2. 

i=l  i=l 

Notice  that 

'E[f(x)Ti-rg(y)T1-rg(z)]  -  'E[T1-yf(x)T1-rg(y)Ti-rg(z)] 

3-fj  sre 

=  E  [f(x)g(y*)g(z*)]~  E  [Tl.r,f(x)g(y*)g(z*)-\  =  E  \g{y*)g(z*)(I  -  T^Y)f{x)\  (5.18) 

or*  or*  1  or*  1 

u  e  u  e  u  Q 

Now  we  can  view  g(y*)g(0*)  as  a  whole.  Similar  to  the  proof  of  (5.15)  and  (5.13),  we 
can  bound  the  |(5.18)|  by  f.  Overall,  we  prove  that 

\E[f(x)g(y)g(z)]-E[T1.r(f(x)T1.rg(y)T1.rg(z)]\  <  3/3. 

□ 

Above  proof  is  similar  to  the  setup  of  Lemma  6.2  in  [115].  The  main  reason  we  can  not 
use  it  directly  is  our  distribution  has  p(3P  ,<3/  y.j£-,2Te)  =  1.  In  addition,  the  Bonami-Beckner 
Operator  we  need  to  use  is  different  from  the  one  used  in  that  Lemma. 


5.5  Probability  Space 

Proof  of  Lemma  5.3.8: 

Proof.  Let  us  first  prove  a  graph  property  of  JPd. 

Lemma  5.5.1.  Define  a  bipartite  graph  G(2P  x  <3(,2)  as  follows:  if 

Prjes{{x,yi,..yd,zi..zd)>0, 

{x,yi,...yd),{zi,..zd)  are  in  G  and  there  is  an  edge  between  them.  Then  G  is  connected. 

Proof.  This  is  a  bipartite  graph  with  no  isolated  nodes  since  nodes  are  included  only  if 
they  are  on  some  edge.  Notice  that  JPg  =  (1  -  8)JP  +  8JV .  By  definition  of  oV,  we  know 
Czi,. .zd)  has  edge  with  (xj  =  z±,yi  =  z\,y 2  =  -x\z<i..yd  -  -x\ zd).  And  by  definition  of  JP, 
(xi  =  zi, yi  =  01,3/2  =  ~xiZ2...yd  =  -xi zd)  has  edge  with  (-1  ,Z2-..zd).  Therefore,  (zi,..zd)  is 
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connected  with  ( ~1,Z2---Zd )  if  they  are  not  the  same  node.  Essentially,  (zi,..Zd)  has  edge 
with  the  node  that  set  one  of  the  Zj  to  -1:  (zi,Z2,..Zi-i,Zi  =  -l,Zi+\,..Zd)  if  they  are  not  the 
same  nodes.  Notice  that  (1, ...  1)  is  in  2  and  it  can  reach  any  nodes  in  2  by  setting  some 
coordinates  to  -1.  Also,  every  node  in  SC  x  <3/  is  connected  with  some  node  in  2.  The  graph 
is  therefore  fully  connected.  □ 


Since  2€g  is  connected.  The  smallest  probability  event  in  2€g  is 
Lemma  2.9  in  [115],  we  know:  p(2  x  3/  ,2\,/8€g)  <  1  - 


d2d ‘ 


22d+1d2 ‘ 


By  applying 

a 


Proof  of  Lemma  10.5.1 

Proof.  Notice  that  2g  ,2  and  Jf ’s  marginal  distributions  on  SP  are  all  uniform. 

(SP,&x2;2s)=  sup  E[/G]  =  (l-5)sup(E[/G]  +  5E[/-G])  = 

/\G,E[/]=E[G]=0-%  f,G  & 

E[/-2]=E[G2]=1 

(1  -  5)E[/-]E[G]  +  8 E[/G]  <  0  +  8  I E[\f\2]E[G2].  (5.19) 

Jf  \l  Jf  Jf 

We  know  E ^[/2]  =  E &s[f2]  =  1.  Also  notice  that  1  =  Ejr,[G2]  =  (1-5)E^[G2]  +  5E ^[G2]  > 
SE^IG2].  We  have  E^tG2]  <  1/8  and  therefore  we  can  bound  (5.19)  by  VS.  □ 


Proof  of  Lemma  5.3.9 

Proof.  We  know  2  /V  are  independent  in  J€.  Also  by  definition  J€g  =  {l-  8)JP  +  8Jf . 
Notice  that  the  marginal  distributions  of  both  JPg  and  .2  on  rW  and  2  are  the  same 
(uniform  distribution),  we  have 

p(2,<3f;JPg)  =  sup  E  [/(x)g(y)]  =  (1  -  d)sup(E[/(x)g(y)]  +  8 E[/(x)g(y)]) 
/•^,E[/]=E[g]=0^5  f,g  Je  -88 

E[/-2]=E[^2]=1 

=  8E[fg]<sjE\f2]E\g2]  =  8. 

J8  Y  -88-88 

□ 


5.6  Matrix  Theory 


Lemma  5.6.1.  Aj  and  Bi  are  mi  x  m,-  matrix.  And  we  know  Aj,  Aj  +  Bi  and  Aj  -Bi  are 
positive  matrices.  Then  for  any  n,  ®”=1Aj-®”=1.Bj  and  ®j*=1Aj  +  (£)’l=1Bj,  are  both  positive 
matrices. 


Proof.  We  prove  the  claim  by  induction  on  n. 

Base:  The  base  case  n  =  1  is  already  known  fact. 

Induction  step:  Suppose  it  is  hold  for  n  =  k,  for  n  =  k  +  1,  we  know 


(k+i 


k  +  \  \ 


(g)Af  -  ®£j  =  (Ak+1  +Bk+1)((g)Ai-<g)Bi)  +  (Ak+1-Bk+1)«g)Ai+<g)Bi). 


\i= 1 


i= 1  / 


i- 1 


i= 1 


i- 1 
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By  Induction,  (Ak+1  +  Bk+1),(<S>ki=1Ai  -  ^=1Bi),(Ak+1  -  Bk+1),(<gil.=1AI .  +  ®ki=1Bi)  are  all 
positive  matrices.  Therefore,  Aj  is  positive. 

By  a  similar  argument,  since  we  know 

rk+l  k+l  \  k  k  k  k 

2  ®Ai  +  ®Bi\=(Ak+i+Bk+i)(®Ai  +  ®Bi)  +  (Ak+i-Bk+iK®Ai-0Bi), 

ki=l  j=l  /  i=l  i=l  i= 1  i=l 

we  have  that  (g)^1  Aj  +  ® ^ B is  also  positive.  □ 
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Chapter  6 

SDP  gaps  for  variants  of  Label  Cover 
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6.1  Introduction 


In  this  chapter,  we  mainly  study  the  SDP  gap  for  2-to-l  Label-Cover  (as  well  some  other 
variants). 

6.1.1  Motivations 

The  main  reason  to  study  d-to-1  Label-Cover  is  to  understand  the  approximability  of 
satisfiable  instance.  For  example,  in  Chapter  5,  we  study  the  applications  d-to-1  conjec¬ 
ture  on  the  3-CSP  examples.  Another  hardness  result  for  satisfiable  instance  is  the  graph 
coloring  problem:  an  important  result  is  due  to  Dinur,  Mossel,  and  Regev  [43]  who  used  the 
“2-to-l  Conjecture”  as  well  as  the  “2-to-2  Conjecture”,  and  the  “a-Constraint  Conjecture”. 
(These  conjectures  will  be  described  formally  in  Section  6.3.)  An  instance  of  Label-Cover 
with  a-constraints  was  also  implicit  in  the  result  of  Dinur  and  Safra  [44],  on  the  hardness 
of  approximating  minimum  vertex  cover. 


6.1.2  Statements  of  the  Conjectures 

We  have  already  defined  d-to-1  Label-Cover  in  Section  2.5.  2-to-l  Label-Cover  is  the 
special  case  of  d  =  2.  We  restate  its  definition  here: 

Definition  1.  A  mapping  n  :  [R]  — > ►  [R]  is  said  to  be  2-to-l  if  for  each  element  j  e  [R]  we 
have  |tt— 1(j)l  <  2.  A  Label-Cover  instance  is  said  to  be  2-to-l  if  all  its  constraints  are 
2-to-l  projections. 

Conjecture  1.  [97]  f2-to-l  Conjecture^  For  any  8  >  0,  for  2-to-l  Label-Cover  with 
alphabet  size  large  enough  (while  still  being  a  constant  depending  only  on  8),  it  is  NP-hard 
to  { 1,1-  Sfapproximate  the  problem. 

6.1.3  Evidence  for  and  against 

Despite  significant  work,  the  status  of  the  UGC —  as  well  as  the  2-to-l,  2-to-2,  and  a- 
Constraint  Conjectures  —  is  unresolved.  Towards  disproving  the  conjectures,  the  best  al¬ 
gorithms  known  are  due  to  Charikar,  Makarychev,  and  Makarychev  [28].  Using  somewhat 
strong  SDP  relaxations,  those  authors  gave  polynomial-time  SDP-rounding  algorithms 
which  achieve: 

•  Value  K~e^2~£^  (roughly)  for  Unique  Label-Cover  instances  with  SDP  value  1  —  e 
over  alphabets  of  size  K. 

•  Value  if-3+2v/2-e  for  2-to-l  Label-Cover  instances  with  SDP  value  1  -  0(c)  over 
alphabets  of  size  K. 

The  best  evidence  in  favor  of  the  UGC  is  probably  the  existence  of  strong  SDP  gaps.  The 
first  such  gap  was  given  by  Khot  and  Vishnoi  [107]:  they  constructed  a  family  of  Unique 
Label-Cover  instances  over  alphabet  size  K  with  SDP  value  1  —  e  and  integral  optimal 
value  K In  addition  to  roughly  matching  the  CMM  algorithm,  the  Khot-Vishnoi  gaps 
have  the  nice  property  that  they  even  hold  with  Triangle  Inequality  constraints  added  into 
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the  SDP.  Even  stronger  SDP  gaps  for  UGC  were  obtained  recently  by  Raghavendra  and 
Steurer  [127]. 

Standing  in  stark  contrast  to  this  is  the  situation  for  the  2-to-l  Conjecture  and  related 
variants  with  perfect  completeness.  Prior  to  this  work,  there  were  no  known  SDP  gap 
families  for  these  problems  with  SDP  value  1  and  integral  optimal  value  tending  to  0  with 
the  alphabet  size.  Indeed,  there  was  hardly  any  evidence  for  these  conjectures,  beyond  the 
fact  that  the  algorithm  in  [28]  failed  to  disprove  them. 

6.1.4  SDP  gaps  as  a  reduction  tool 

In  addition  to  being  the  only  real  evidence  towards  the  validity  of  the  UGC,  SDP  gaps 
for  Unique-Games  have  served  another  important  role:  they  serve  as  starting  points 
for  strong  SDP  gaps  for  other  important  optimization  problems.  A  notable  example  of 
this  comes  in  the  work  of  Khot  and  Vishnoi  [107]  who  used  the  UG  gap  instance  to  con¬ 
struct  a  super-constant  integrality  gap  for  the  Sparsest  Cut-SDP  with  triangle  inequali¬ 
ties,  thereby  refuting  the  Goemans-Linial  conjecture  that  the  gap  was  bounded  by  0(1). 
They  also  used  this  approach  to  show  that  the  integrality  gap  of  the  Max-Cut  SDP  remains 
0.878  when  triangle  inequalities  are  added.  Indeed  the  approach  via  Unique-Games  re¬ 
mains  the  only  known  way  to  get  such  strong  gaps  for  Max  Cut.  Recently,  even  stronger 
gaps  for  Max-Cut  were  shown  using  this  framework  in  [96,  127].  Another  example  of  a 
basic  problem  for  which  a  SDP  gap  construction  is  only  known  via  the  reduction  from 
Unique-Games  is  Maximum  Acyclic  Subgraph  [67]. 

In  view  of  these  results,  it  is  fair  to  say  that  SDP  gaps  for  Unique-Games  are  signif¬ 
icant  unconditionally,  regardless  of  the  truth  of  the  UGC.  Given  the  importance  of  2-to-l 
and  related  conjectures  in  reductions  to  satisfiable  CSPs  and  other  problems  like  coloring 
where  perfect  completeness  is  crucial,  SDP  gaps  for  2-to-l  Label-Cover  and  variants  are 
worthy  of  study  even  beyond  the  motivation  of  garnering  evidence  towards  the  associated 
conjectures  on  their  inapproximability. 


6.2  Our  Results 

Label-Cover  admits  a  natural  SDP  relaxation  (see  Figure  6.1).  In  this  work,  we  show 
the  following  results  on  the  limitations  of  the  basic  SDP  relaxation  for  Label-Cover 
instances  with  2-to-l,  2-to-2,  and  a  constraints: 

•  There  is  an  instance  of  2-to-2  Label-Cover  with  alphabet  size  K  and  optimum 
value  0{H\ogK)  on  which  the  SDP  has  value  1. 

•  There  are  instances  of  2-to-l  and  a-constraint  Label-Cover  with  alphabet  size  K 
and  optimum  value  0(1/ \J\ogK)  on  which  the  SDP  has  value  1. 

In  both  cases  the  instances  have  size 

We  note  that  if  we  only  require  the  SDP  value  to  be  1  -  e  instead  of  1,  then  integrality 
gaps  for  all  these  problems  easily  follow  from  gaps  from  Unique-Games,  constructed  by 
Khot  and  Vishnoi  [107]  (by  duplicating  labels  appropriately  to  modify  the  constraints). 
However,  the  motivation  behind  these  conjectures  is  applications  where  it  is  important 
that  the  completeness  is  1.  Another  difference  between  the  2-to-l  Label-Cover  and  the 
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Unique  LABEL-COVER  is  the  fact  that  for  2-to-l  instances,  it  is  consistent  with  known 
algorithmic  results  of  [28]  that  Opt  be  as  low  as  K~c  for  some  c  >  0  independent  of  e,  when 
the  SDP  value  is  1  -  e.  It  is  an  interesting  question  if  Opt  can  be  indeed  this  low  even 
when  the  SDP  value  is  1.  Our  constructions  do  not  address  this  question,  as  we  only  show 

OPT  =  CKl/\fiogK). 

We  also  point  out  that  our  integrality  gaps  are  for  special  cases  of  the  Label-Cover 
problem  where  the  constraints  can  be  expressed  as  difference  equations  over  F2-vector 
spaces.  For  example,  for  2-to-2  Label-Cover,  each  constraint  <pe  is  of  the  form  x  —  y  e 
{a,a  +  y}  where  a, ye  Fg  are  constants.  However,  treating  the  coordinates  ( x\,...,Xk )  and 
(yi,...,y&)  as  separate  Boolean  variables,  and  introducing  an  auxiliary  Boolean  variable 
ze  for  the  constraint,  we  can  re-write  it  as  a  conjunction  of  linear  equations  over  F2: 

k 

A  [xi-yi-Ze-n  =  ai)- 

i= 1 

Here  Xi,yi,at,ji  denote  the  ith  coordinates  of  the  corresponding  vectors.  Then  the  problem 
of  deciding  whether  the  instance  is  completely  satisfiable  ( OPT  =  1)  or  not  ( OPT  <  1), 
reduces  to  deciding  whether  the  system  of  linear  equations  as  above,  is  satisfiable.  This 
can  be  easily  done  in  polynomial  time. 

Despite  this  tractability,  the  SDPs  fail  badly  to  decide  satisfiability.  This  situation  is 
similar  to  the  very  strong  SDP  gaps  known  for  problems  such  as  3-XOR  (see  [131],  [139]), 
for  which  deciding  complete  satisfiability  is  easy. 


6.3  Preliminaries  and  Notation 

6.3.1  2-to-l,  2-to-2  and  a  Label-Cover  Problems 

Recall  that  a  Label-Cover  instance  5£  is  defined  by  a  tuple  (U,V  ,E,P,Ri,R2,U).  Here 
U  and  V  are  the  two  vertex  sets  of  a  bipartite  graph  and  E  is  the  set  of  edges  between  U 
and  V .  P  is  an  explicitly  given  probability  distribution  on  E.  R 1  and  R2  are  integers  with 
l<i?i<i?2-  nisa  collection  of  “projections”,  one  for  each  edge:  n  =  {ne  :  [R2]  — > -  [R 1]  I  e  e 
E}. 

Here  an  edge  (u,v)  is  satisfied  by  a  assignment  L  if  L(u)  -  jtu,v(L(v)).  The  constraint 
on  each  edge  is  a  projection.  As  for  the  2-to-2  and  a  Label-Cover,  the  constraint  on  each 
edge  is  called  2-to-2  and  a  defined  as  follows. 

Definition  2.  A  constraint  n  Q  {1,...,2R}2  is  said  to  be  a  2-to-2  constraint  if  there  are  two 
permutations  0\,02  :  [1,. ..  ,2R)  •->■  [1,.. .  ,2R)  such  that  ( i,j )  e  n  if  and  only  if  (cri(i),<j2(j))  £ 
T  where 

T  :=  {{21  -  1,2 1  -  1),(2 1  -  1,21), (21, 21  -  1),  (2Z , 2Z )}f=1 

A  Label-Cover  instance  is  said  to  be  2-to-2  if  all  its  constraints  are  2-to-2  constraints. 

A  constraint  n  c  {l,...,2i?}2  is  said  to  be  an  a-constraint  if  there  are  two  permutations 
cr  i,cr2  :  [1,. . .  ,2R]  •->■  [1,. . .  ,2R)  such  that  ( i,j )  £  n  if  and  only  if  (cri(i),a2(i))  £  T'  where 

T'  :=  {(21  -1,2 1-  1),(2 1  -  1,21), (21, 21  -  l)]f=1 

A  Label-Cover  instance  is  said  to  be  a  if  all  its  constraints  are  a  constraints. 


132 


maximize 
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V  V  £  V 

ie[R] 

(z(v,i),z(vJ))  ~  0 

V  i^je[R],veV 

Figure  6.1:  SDP  for  Label-Cover 


Conjecture  2.  [43]  f2-to-2  Conjecture!  For  any  8  >  0,  it  is  NP-hard  to  decide  whether  a 
2-to-2  Label-Cover  instance  ££  has:- 

•  OPT{5£)  =  1 

•  OPT(SP)  <  8 

It  was  shown  in  [43]  that  the  2-to-2  Conjecture  is  no  stronger  than  the  2-to-l  Conjec¬ 
ture. 

Conjecture  3.  [43]  (a  Conjecture!  For  any  8  >  0,  it  is  NP-hard  to  decide  whether  a  a 
Label-Cover  instance  5£  has:- 

•  OPT{5£)  =  1 

•  OPT{5£)  <  8 

By  abuse  of  notation,  for  the  2-to-2  or  a  Label-Cover,  we  use  nu>v  to  denote  the 
constraint  on  an  edge  (u,v)  and  it  an  edge  is  satisfied  if  ( L(u),L(v ))  £  nU}V.  For  the  case  of 
2-to-l  where  each  nUtV  is  a  projection  relationship,  we  use  ( L(u),L(v ))  £  nUjV  as  a  equivalent 
statement  of  nUjV(L(v))  -L{u). 

In  Figure  6.1,  we  write  down  a  natural  SDP  relaxation  for  the  Label-Cover  problem. 
The  relaxation  is  over  the  vector  variables  Z(u>£)  for  every  vertex  v  £  V  and  label  i  £  [i?]. 

6.3.2  Fourier  Analysis 

We  will  use  the  Fourier  Analysis  on  where  F2  =  {0,1}.  1 

Let  y  :=  {f  :  — »  3%}  denote  the  vector-space  of  all  real  functions  on  Ffj,  where  addition 
is  defined  as  point-wise  addition  over  F2  (or  .  We  always  think  of  F^  as  a  probability  space 
under  the  uniform  distribution,  and  therefore  use  notation  such  as  ||/j|p  :=  E  F*[|/'(:x:)|p]. 
For  f ,  g  £  & ,  we  also  define  the  inner  product  (f,g)  :=  ’E>Vf{x)g{x)\. 

For  any  a  £  F^  the  %a  £  &  as  %a(x)  '■=  (-l)ax,  Vx  £  The  Fourier  characters  form  an 
orthonormal  basis  for  Y  with  respect  to  the  above  inner  product,  hence  every  function  f  e  Y 
has  a  unique  representation  as  f  =  ZaelF*  f(cc)Xa,  where  the  Fourier  coefficient  f(a)  := 

We  also  sometimes  identify  each  a  with  the  set  Sa  =  [i|aj  =  1}  and  denote  the  Fourier 
coefficients  as  f(S).  We  use  the  notation  \a\  for  |Sa|,  the  number  of  coordinates  where  a 

lrThe  reader  may  notice  that  in  other  Chapters,  we  are  using  Harmonical  Analysis  over  {-1, 1}”  — »  K.  The 
switch  (to  F2)  is  mainly  for  the  notational  convenience  in  this  Chapter. 
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is  1. 

We  need  the  following  result  due  to  Talagrand  (“Proposition  2.3”  in  [135]),  proven  using 
hypercontractivity  methods: 

Theorem  6.3.1.  Suppose  F  :  — > ►  IR  has  ELF]  =  0.  Then 


Y  F(af/\a\  =  0 

aeF*\{0} 


l|F||2  ] 

^ndlFll^ellEh))]' 


More  precisely,  we  will  need  the  following  easy  corollary: 

Corollary  6.3.2.  IfF  :  Fg  — ►  {0, 1}  has  mean  1/K,  then 

F(0f+  Y  F(a?/\a\=Oa/(K\ogK)) 

ae  F|\{0} 

Proof.  We  have  F(  0)2  =  E[F]2  =  1/K2  <  0(1/(K  log  if)),  so  we  can  disregard  this  term.  As 
for  the  sum,  we  apply  Theorem  6.3.1  to  the  function  F'  =  F  -  1/K,  which  has  mean  0  as 
required  for  the  theorem.  It  is  easy  to  calculate  that  ||F'||2  =  0(1/ and  ||F'||i  =  0(1  /K), 
and  so  the  result  follows.  □ 


6.4  Integrality  Gap  for  2-to-2  Games 

We  first  give  an  integrality  gap  for  LABEL-COVER  with  2-to-2  constraints.  The  instance 
for  2-to-l  Label-Cover  will  be  an  extension  of  the  one  below.  In  fact,  our  analysis  of 
Opt  in  the  2-to-l  case  will  follow  simply  by  reducing  it  to  the  analysis  of  Opt  for  the  2-to-2 
instance  below. 

The  vertex  set  V  in  our  instance  is  same  as  the  vertex  set  of  the  Unique-Games 
integrality  gap  instance  constructed  in  [107].  Let  &  :=  {f  :  >-»■  F2}  denote  the  family  of 

all  Boolean  functions  on  F^.  For  f,g  e  define  the  product  f  g  as  ( fg)(x )  :=  f{x)g(x). 
Consider  the  equivalence  relation  ~  on  &  defined  as  f  ~  g  o  3a  e  F^  s.t.  f  =  gXa-  This 
relation  partitions  &  into  equivalence  classes  SP\,...,SPn,  with  n  :=  2 k/K.  The  vertex  set 
V  consists  of  the  equivalence  classes  {SPfue^.  We  denote  by  \&i\  the  lexicographically 
smallest  function  in  the  class  2?i  and  by  SPf,  the  class  containing  f . 

We  take  the  label  set  to  be  of  size  K  and  identify  [A]  with  F^  in  the  obvious  way.  For 
each  tuple  of  the  form  (y ,f,g)  where  y  e  F^  \  {0}  and  f,g  e  3P  are  such  that  (1  +  %y)f  — 
(1  +  yr)g,  we  add  a  constraint  n (y,f,g)  between  the  vertices  SPf  and  SPg.  Note  that  the 
condition  on  f  and  g  is  equivalent  to  saying  that  yr(x)  =  1  =>  f(x)  =  g(x).  If  f  =  VSPf]Xa 
and  g  =  \.2?g\xp  and  if  A  :  [ n\  — ►  F^  denotes  the  labeling,  the  relation  Ji(Y,f,g)  is  defined  as 

(A(0°f),A{0°g))  e  7Tirjtg)  o  (A(&>f)  +  a)-(A(0Bg)  + p)  e  {0,y}. 

Note  that  for  any  cn  e  F^,  the  constraint  maps  the  labels  [a»,w  +  y]  for  SPf  to  the  labels 
{a>  +  a-p,to  +  a-p  +  Y}  for  SPg  in  a  2-to-2  fashion.  We  denote  the  set  of  all  constraints  by  n. 
We  remark  that,  as  in  [107],  our  integrality  gap  instances  contain  multiple  constraints  on 
each  pair  of  vertices. 
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6.4.1  SDP  Solution 


We  give  below  a  set  of  feasible  vectors  Z(^  a)  e  [R-8-  for  every  equivalence  class  2Pi  and  every 
label  a,  achieving  SDP  value  1.  Identifying  each  coordinate  with  an  x  £  F^,  we  define  the 
vectors  as 

Z i&i,a)(x)  :=  —  (l&dXaXx). 

ll  ll  2 

It  is  easy  to  check  that  |z(#>i)ff)  =  1/K  for  each  of  the  vectors,  which  satisfies  the  first 

constraint.  Also,  Z(^>i>a)  and  Z(^>i;jg)  are  orthogonal  for  a  f  f)  since 

<z( &i,a),Z&iip))  =  =  J^2  (Xa,Xp)  =  0 

using  the  fact  that  [2?i\2  =  1.  The  following  claim  proves  that  the  solution  achieves  SDP 
value  1. 

Claim  6.4.1.  For  any  edge  e  indexed  by  a  tuple  ( J,f,g )  with  f(  1  +  Xy)  =  g(  1  +  XyX  we  have 

Y  (z(^,wi)5z(^,w2))  =  1 

(0\,0>2  £Jl(y,f,g) 

Proof.  Let  f  =  [2? fix  a  and  g  =  [£Pg\Xp-  Then,  (coy  ,oj2)  £  ne  iff  (wi  +  a)  -  (o)2  +  f)  £  {0,y}. 
Therefore,  the  above  quantity  equals  (divided  by  2  to  account  for  double  counting  of  of) 


o  'Y^  (^z(^y,ft)+a)> ^(&>g,to+f))')  +  (z(£?>f,(o+a+y)>z(£?>g,w+f)) 
&  ai  v  '  ' 


+ 


Z(^y,w+a)>  ^(SPg  ,ct)+ fi+y)^  +  ^Z(g8^.jW+a+y),  Z(^>g;W+^+y)M 


p  ^  ^Z (g?f,u)+a)  +  Z (^y,(u+a+y), z (SPf ,w+ f))  +  ,io+ f)+y) 


(6.1) 


However,  for  each  o),  we  have  Z(c^y)(y+a)  +  Z(^yjtt)4-a+y)  —  Zi(gp^ll)+ p)  +  z^opy^+p+y)!  since  for  all 
coordinates  x, 


z(87>f,u>+a)(x)  +  Z(c^y;W+a+y)(x)  —  ([^y]yw+a(x)  +  [^y]yw+Q;+y(x)) 

=  ^(f(x)  +  fXy)Xcj(x) 

=  ^(g(x)  +  gXy)Xw(x) 

=  j£^’^>£^Xaj+fi(x)  +  [5^]yw+jg+y(x)) 

=  Z(3a/>w+)3)(x)  +  Z(ga/)W+i3+y)(x). 

This  completes  the  proof  as  the  value  of  (6.1)  then  becomes 


1 v  2  _ 

n  2-j  z(3?f,ti)+a)  +  ’l‘(8?f,u)+a+y)  —  ~  /  . 

ft)  ^  ft) 


Z(^,ft)+a)  +  Z((3a^.;W+a+y) 


=  1 


□ 
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6.4.2  Soundness 


We  now  prove  that  any  labeling  of  the  instance  described  above,  satisfies  at  most  0{l/\ogK ) 
fraction  of  the  constraints.  Let  A  :  V”  —  be  a  labeling  of  the  vertices.  We  extend  it  to  a 
labeling  of  all  the  functions  in  3P  by  defining  A([^]ya)  :=  A{3P\)  +  a. 

For  each  a  e  Fff  define  Aa  :  3P  — »  {0,1}  to  be  the  indicator  that  A’s  value  is  a.  By 
definition,  the  fraction  of  constraints  satisfied  by  the  labeling  A  is 


val(A) 


E 

(y,f,g)£n 


Y,  Aa(f)(Aa(g)  +  Aa+r(g)) 

CCE  fk2 


E 

(r,f,g)tn 


Aa(f)(Aa(g)  +  Aa{gxr)) 


(6.2) 


where  the  last  equality  used  the  fact  that  for  every  tuple  ( J,f,g )  e  n,  we  also  have  (j,f  ,g%y) e 

71. 

Note  that  the  extended  labeling  A  :  &  — »  takes  on  each  value  in  F^  an  equal  number 

of  times.  Hence 

Ef[Aa{f)\  =  ~Pr f[A(f)  =  a]  =  1/K  for  each  a  eFg.  (6.3) 

For  our  preliminary  analysis,  we  will  use  only  this  fact  to  show  that  for  any  a  e  Fg  it  holds 
that 

E  [Aa(f)Aa(g)\<0(H(K\ogK)).  (6.4) 

(y,f,g)en 

It  will  then  follow  that  the  soundness  (6.2)  is  at  most  0{l/\ogK ).  Although  this  tends  to  0, 
it  does  so  only  at  a  rate  proportional  to  the  logarithm  of  the  alphabet  size,  which  is  K  =  2k. 


Beginning  with  the  left-hand  side  of  (6.4),  let’s  write  F  =  Aa  for  simplicity.  We  think  of 
the  functions  f  and  g  being  chosen  as  follows.  We  first  choose  a  function  h  :  yx  — ►  ¥2-  Note 
that  y1  Q  F^  is  the  set  of  inputs  where  yr  =  1  and  hence  f  -  g,  and  we  let  f(x)  -  g(x)  =  h(x) 
for  x  e  y±.  The  values  of  f  and  g  on  the  remaining  inputs  are  chosen  independently  at 
random.  Then 


E  [F(f)F(g)]  =  E  E  [Eftglh[F(f)F(g)]}=E  E 

(y,f,g)en  y  h-.j- L-F2  Y  h-.yL^f2 


[Enh[F(f)]Eglh[F(g)]} .  (6.5) 


Let  us  write  PrF(h )  for  E f\hF(f),  which  is  also  equal  to  Eg^Fig).  We  now  use  the  Fourier 
expansion  of  F.  Note  that  the  domain  here  is  F^  instead  of  F^.  To  avoid  confusion  with 
characters  and  Fourier  coefficients  for  functions  on  F^,  we  will  index  the  Fourier  coeffi¬ 
cients  below  by  sets  S  c  F^ .  Given  an  f  e  V,  we’ll  write  fs  for  fixes  Ax)  (which  is  a  Fourier 
character  for  the  domain  F|0.  Now  for  fixed  y  and  h, 


PrF(h)  =  Enh[F(f)]  =  Enh 


=  £  F(S)-Em[fsl 

Sqf% 
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The  quantity  E/-|/i[/‘*s]  is  equal  to  hs  if  S  Q  y1  as  is  0  otherwise.  Thus,  using  the  Parseval 
identity,  we  deduce  that  (6.5)  equals 


E  E  \(PrF(h))2]  =  E 

T  A:r-L^F2  7 


E  {?&))' 

Sey-1 


E  Pry[S  Q  j1]  ■  (F(S))2 . 


Sep* 


Recalling  that  y  e  \  {0}  is  chosen  uniformly,  we  have  that 


E  Prr[S  c  yx]  •  (F(S)f  =  E  2“dim(S)  •  [F(S)f 


s&k2 


Sc  F* 


where  we  are  writing  dim(S)  =  dim(span  S)  for  shortness  (and  defining  dim(0)  =  0).  For 
|S  |  >  1  we  have  dim(S)  >  log2  |S|  and  hence  2_dim^  >  1/|S|.  Thus 


E  2_dim(S)  -E(S)2  <F{0)2  +  E  TO)2/|S|. 


ScF^ 


0^S£  F* 


Corollary  6.3.2  shows  that  this  is  at  most  0(l/(K\ogK)).  This  completes  the  proof,  as 


val(A)  =  2-  E  E {r,f,g)eAAa(f)Aa(g)]  <  2-  E  2“dim(S)Aa(S)2  =  0(l/log*0. 

ae¥2  aef2 

6.5  Integrality  Gap  for  2-to-l  Label-Cover 

The  instances  for  2-to-l  Label-Cover  are  bipartite.  We  denote  such  instances  as  (U,V  ,E  ,R±,R2,Tl) 
where  i?2  =  2i?i  denote  the  alphabet  sizes  on  the  two  sides.  For  a  bipartite  instance,  the 
Label-Cover  SDP  can  be  written  in  the  following  form  involving  vectors  y(u>i)  for  each 
u  eU,i  e[Ri  ]  and  vectors  z  (VJ)  for  each  v  £  v,j  £ 


maximize 

E e=(u,v)eE 

E  (y (u,ne(i))’z(v,j)) 

subject  to 

II  lb 

/  ,  ||y(a.i)ll 

2  =  i 

V  ueU 

ie[RiJ 

II  lb 

E  llzUv)|| 

2  =  i 

V  v  £  V 

ie[R2] 

(y(u,i)>y(u,j)}  -  o 

V  i  ^  j  e[Ri\,u  eU 

(z(v,i)>z(v,j))  =  o 

V  i^je[R2],v£V 

Figure  6.2:  SDP  for  2-to-l  games 
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6.5.1  Gap  Instance 

As  in  the  case  of  2-to-2  games,  the  set  V"  consists  of  equivalence  classes  . .  .,£?n,  which 
partition  the  set  of  functions  &  -  {f  :  — * ►  F2},  according  to  the  equivalence  relation  ~ 

defined  as  f  ~  g  o  3a  £  F^  s.t.f  =  gXa ■  The  label  set  [A2]  is  again  identified  with  and 
is  of  size  K  -  2k . 

To  describe  the  set  U,  we  further  partition  the  vertices  in  V"  according  to  other  equiv¬ 
alence  relations.  For  each  y  £  f\,y  ^  0,  we  define  an  equivalence  relation  =r  on  the  set 
as 

2?i=y2?j  o  3f  g  2?i,g  £  2Pj  s.t.  f(l  +  xr)  =  g(l  +  Xr) 

This  is  equivalent  to  saying 

=y  2Pj  o  3f  £  2?i,g  £  2?j  s.t.  fg(x)  =  -1  =>  yr(x)  =  -lVx  e 

This  partitions  3P\, . .  .,2An  (and  hence  also  the  set  &)  into  equivalence  classes  . .  .,2tJn. 
Here  m  =  2k/2+1/K  (this  is  immediate  from  the  second  definition  and  the  fact  that  n  = 

2 k/K)  and  the  partition  is  different  for  each  y.  The  set  U  has  one  vertex  for  each  class  of 
the  form  <2;r  for  all  i  £  [m]  and  ye  F^  \  {0}.  As  before,  we  denote  by  [-S;r]  the  lexicographi¬ 
cally  smallest  function  in  the  class  and  by  <2^  the  class  under  =r  containing  f .  Note 
that  if  f  £  ,  then  there  exists  a  /3  £  Fg  such  that  f(  1  +  Xy)  =  [£2j]xp(l  +  ly>- 

The  label  set  R\  has  size  KI2.  For  each  vertex  22Y  e  U,  we  think  of  the  labels  as  pairs 
of  the  form  { a,a  +  y }  for  a  £  F^.  More  formally,  we  identify  it  with  the  space  F \t{y).  We 
impose  one  constraint  for  every  pair  of  the  form  (y,f)  between  the  vertices  SPf  and  If 

f  =  [SPf^Xa  and  /(l  +  Xy)  -  [2tJ]Xp(  1  +  XyX  then  the  corresponding  relation  1 T(rj)  is  defined 
by  requiring  that  for  any  labelings  A  :  V  —> ►  LA  2]  and  B  :U  ~*[R  1], 

(B(QYf),A(£?f))£n{yj)  o  A(3°f)  +  a  £  B(£}p  +  l 3. 

Here,  if  B(£2j)  is  a  pair  of  the  form  {a>,a)+y},  then  A(2p+/3  denotes  the  pair  {co+ f3,a>+y+ f3}. 

6.5.2  SDP  Value 

As  before,  we  give  a  set  of  vectors  y  (<2r  {a  a+r))  and  z  in  IK'8',  identifying  each  coordinate 
with  an  x  e  Ffj .  We  define  the  vectors  as 

y(^,{a,a+r})(x)  :=  ^i^PXaO-  +  Xy))(x)  and  *(&i,a)(x)  :=  \  ^Xa)  (*X 

We  have  already  shown  that  (z^^z^^))  =  0  for  a  ^  p  and  ||z(5»i>a)||  =  1/K.  It 
again  follows  by  the  orthogonality  of  characters  that  for  disjoint  pairs  {a,  a  +  y}  and  {/3,  ft  + 
y},  the  vectors  y(^r  {a  a+j})  and  y(^r  ^  j8+r))  are  orthogonal.  It  is  also  easy  to  verify  that 
2 

y (_gr  {a  a+r))  =  2 IK.  Hence,  the  vectors  form  a  feasible  solution. 

To  show  that  the  SDP  value  is  equal  to  1,  we  consider  an  arbitrary  constraint  indexed 
by  the  pair  (y,f).  Let  f  =  [2?f]Xa  and  f(  1  +  Xy)  -  [<2^]y/3(l  +  Xy)-  Then  for  any  co  £  Fg,  this 
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constraint  maps  the  label  co  + a  for  SAf  to  the  pair  {o»  +  f3,  a)  +  y  +  /3}  for  Hence,  the  value 
of  the  SDP  solution  on  this  constraint  is  given  by 

Z  (y(^[r,{w+^,w+)6+r})’Z(3ai,a+")) 

we  ¥k2 

We  will  show  that  for  every  w,  y^^+p^+p+y})  =  ^i&^a+w)  +  ^m,a+w+r)-  This  will  complete 
the  proof  as  the  above  expression  then  becomes 

Z  (z(&i  ,a+w)  "t  Z^^a+^+y),  ))  =  Z  Iz(^1,a+^)|2  =  T 

<yeF2  <weF2 

To  show  the  vector  identity,  we  simply  note  that  for  each  coordinate  x,  we  have 

y(BjAw+p,w+p+r}^x)=^^J]^1  +  ^)^  =  ^(Ai+xr))(*) 

=  \  {^f^Xa  +  [S°f]Xa+r)  (x) 

=  z(&>i,a+w)(x)  +  Z(r^>i>a+M+y)(x). 


6.5.3  Soundness 

We  now  bound  the  fraction  of  constraints  satisfied  by  any  pair  of  labelings  A  :  V  — ►  [ K ]  and 
B  :  U  — > ►  [-K/2].  Let  l{<g>}  denote  the  indicator  of  the  event  <?,  and  N(u )  denote  the  neighbor¬ 
hood  of  a  vertex  u  el/.  Then,  the  fraction  of  constraints  satisfied  by  any  assignments  A,B, 
can  be  bound  by  an  application  of  Cauchy-Schwarz  as 


Val(A,5)  =  Eiief/Eyeivcu)  [l{W!i[,(A([;))=S(ii)}] 

<  (E„et/  (Eyeivtu)  [l{^„(A(y))=fi(M)}])2) 

=  ( 


1/2 


^UEU^Vi,V2EN(u) 


( 


EMeJ7El,1;l;2ejV'(M) 


'-{nuvl(A(vi))=B(u)=nllV2(A(v2))} 

,  1 1/2 

\-{nul,1(A(v1))=nUV2(A(v2))} 


,1/2 


Note  that  if  nUVl  and  nUV2  are  2-to-l  projections,  then  the  inner  quantity  in  the  last  ex¬ 
pression  denotes  the  value  of  a  2-to-2  Label-Cover  instance,  each  of  whose  constraints  is 
defined  by  two  2-to-l  constraints  in  the  original  instance.  For  the  2-to-l  instance  described 
above,  we  will  show  that  the  inner  quantity  in  fact  denotes  the  fraction  of  constraints  sat¬ 
isfied  by  A  for  the  2-to-2  instance  described  in  Section  6.4.  This  will  show  that  the  fraction 
of  constraints  satisfied  by  any  assignment  in  the  above  2-to-l  instance  can  be  at  most 
OiVyfitgK). 

To  see  this,  note  that  a  vertex  u  e  U  and  a  vertex  v\  e  V  can  be  sampled  jointly  by 
picking  a  pair  (j,f)  and  taking  u  =  and  v\  =  SAf.  Sampling  V2  £  N(u)  corresponds  to 

choosing  a  class  SAi  such  that  for  some  /3  e  lk2  [SAi\xp(l  +  yr)  =  f(  1  +  yr).  Thus,  V2  can  be 
sampled  by  choosing  a  random  g  such  that  f(  1  +  yr)  =  g(l  +  yy)  and  taking  V2  =  SAg. 
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Also,  if  f  =  [£?f]xai  and  g  =  \SPg\Xa2-> then  the  constraint  nuvl(A(v{))  =  nUV2(A(v2 ))  sim¬ 
ply  requires  that  for  some  weFj,  A(3?f)  +  ai  and  A{£?g)  +  a 2  both  lie  in  the  set  {co,a)  +  y} 
and  hence 


(A{g?f)  + ai)-(A{3Pg)  + a2)£  10,y}. 


6.6  From  2-to-l  Constraints  to  a-constraints 

In  this  section  we  show  that  any  integrality  gap  instance  for  2-to-l  games,  with  sufficiently 
many  edges,  can  be  converted  to  an  integrality  gap  instance  for  games  with  a-constraints. 

The  SDP  we  consider  for  these  games  is  identical  to  the  ones  considered  before,  except  for 
the  objective  function. 

Theorem  6.6.1.  Let  5£  =  (U,V  ,E,R,2R,n)  be  a  bipartite  instance  of  2-to-l  Label-Cover 
problem  with  OPT{5£)  <  8  and  SDP  value  1.  Also,  let  \E\  >  4(|t/|  +  |V|)logCR)/e2.  Then  there 
exists  another  instance  ££'  =  (U ,V ,E ,2R ,n')  of  Label-Cover  with  a-constraints  having 
SDP  value  1  and  OPT(££')  <8  +  e  +  l/R. 

Proof  The  proof  simply  follows  by  adding  R  “fake”  labels  for  each  vertex  u  eU,  and  then 
randomly  augmenting  the  constraints  to  make  them  of  the  required  form.  In  particular, 
let  the  new  labels  we  add  for  each  u  e  U  be  R  +  1, ...  ,2R .  Let  e  =  (u,v)  be  an  edge.  Since 
the  constraints  in  n  are  2-to-l  type,  there  exist  permutations  cri)e  :  [i?]  — *•  LR]  and  <J2,e  '■ 

\2R]  -*  [2 R]  such  that  after  permuting  the  labels  on  each  side,  the  projection  ne  maps 
labels  (2i  -  l,2i)  to  i  i.e.  \(2i  -  1))  =  Jieia^ ^(2i))  =  cr“^(i). 

To  incorporate  the  new  labels  into  the  constraint,  choose  a  random  bijection  a’x  : 

{R  +  1,...,2 R}  — ►  [R],  We  now  construct  a  new  permutation  die  :  [2 R]  — ►  \2R]  as  di;e(i)  = 

2cti ,ed)  ~  1  if  i  -  R  and  dije(i)  =  2a'1  e(i)  if  i  >  R  i.e.  the  new  labels  are  mapped  to  the  even 
positions  2, 4, . . . ,  2 R  while  the  others  are  mapped  to  the  odd  positions. 

The  original  2-to-l  constraints  are  satisfied  by  a  labeling  A  iff  the  pair  (di  te(A(u),  a2,e(A(v))) 
is  of  the  form  (2i  - 1,2  i  —  1)  or  (2  i  -  l,2i)  for  some  i  <R.  We  augment  the  constraint  by  also 
allowing  (di;e(A(u),o-2,e(A(u)))  to  be  (2i,2i-l)  for  some  i.  Note  that  if  the  constraint  is  sat¬ 
isfied  in  this  way,  then  u  must  get  one  of  the  new  labels.  Also,  note  that  the  augmentation 
is  random  as  we  choose  the  map  o'x  independently  at  random  for  each  edge  e. 

Given  a  vector  solution  {y (u,i))ueU,i£[R\  and  {^(v ,jftveV ,je[2R]  for  n,  we  leave  the  vectors 
Z(yj)  unchanged  and  for  each  u  e  U,  take  =  y \v,i)  if  i  <  R  and  0  otherwise.  It  is 
immediate  that  the  solution  is  feasible.  Also,  the  value  of  the  objective  is  the  same  as  the 
value  of  the  2-to-l  SDP,  as  all  the  additional  terms  in  the  objective  involve  some  vector 
Z(„  j)  for  some  i  >  R  and  are  hence  0.  Thus,  the  SDP  value  for  the  new  instance  is  1. 

To  bound  the  optimal  value  of  any  labeling  A  :  U  u  V  — > -  [2 i?],  we  split  it  as 

® e=(u,v)eE  [l{(A(u),A(i;))  satisy  e}]  =  E e=(u,v)eE  [l{A(u)<R} '  l{(A(a),A(i)))  satisy  e}] 

+  =(u,v)eE  [l{A(u)>R}  '  1{(A(k),A(i;))  satisy  e}] 

Note  that  the  first  term  is  simply  the  number  of  2-to-l  constraints  satisfied  by  A  and  it  at 
most  8  by  assumption. 

Also,  for  any  fixed  labeling  A,  the  probability  over  the  choice  of  the  random  maps 
{(jj  e} eeE,  that  (A(u),A(v))  satisfy  e  given  that  A(u)  >  R,  is  at  most  1/R.  By  a  Chernoff 
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bound,  the  fraction  of  edges  ( u ,  v )  satisfied  with  A(u)  >  R  is  at  most  1/R  +e  with  probability 
exp{-e2\E\/2>)  over  the  choice  of  the  random  maps.  By  a  union  bound  and  the  condition  on 
e,  the  second  term  is  at  most  1/R  +  e  for  all  labelings  A,  with  high  probability  over  the 
choice  of  {cr^  (>\eeE-  Picking  an  instance  with  the  appropriate  choice  of  the  maps  o'x  e  gives 
the  required  instance  5£ ' .  □ 

6.7  Discussion 

The  instances  we  construct  have  SDP  value  1  only  for  the  most  basic  SDP  relaxation.  It 
would  be  desirable  to  get  gaps  for  stronger  SDPs,  beginning  with  the  most  modest  exten¬ 
sions  of  this  basic  SDP.  For  example,  in  the  SDP  for  2-to-l  Label-Cover  from  Figure  6.2, 
we  can  add  valid  nonnegativity  constraints  for  the  dot  product  between  every  pair  of  vec¬ 
tors  in  the  set 

[y(M)  |  u  £  U,  i  £  [i?i]}  U  ^z(vJ)  I  v  £  V,j  e  [i?2]}  , 

since  in  the  integral  solution  all  these  vectors  are  {0,  l}-valued.  The  vectors  we  construct 
do  not  obey  such  a  nonnegativity  requirement.  For  the  case  of  Unique-Games,  Khot  and 
Vishnoi  [107]  were  able  to  ensure  nonnegativity  of  all  dot  products  by  simply  taking  tensor 
products  of  the  vectors  with  themselves  and  defining  new  vectors  y\  ..  =  yf2..  =  y<u  n®y(M  n 
and  z'(v  ^  =  Z(vj)  <8  Z(VJ).  Since  (a®2,b®2)  =  (a,b)2,  the  desired  nonnegativity  of  dot 

products  is  ensured. 

We  cannot  apply  this  tensoring  idea  in  our  construction  as  it  does  not  preserve  the 
SDP  value  at  1.  For  example,  for  2-to-l  Label  Cover,  if  we  have  y(u,i)  =  z(v,ji)  +  z(v,j2) 
(so  that  these  vectors  contribute  1  to  the  objective  value  to  the  SDP  of  Figure  6.2),  then 
upon  tensoring  we  no  longer  necessarily  have  y ®u2.^  =  +  zf‘2j2y  Extending  our  gap 

instances  to  obey  the  nonnegative  dot  product  constraints  is  therefore  a  natural  question 
that  we  leave  open.  While  this  seems  already  quite  challenging,  one  can  of  course  be 
more  ambitious  and  ask  for  gap  instances  for  stronger  SDPs  that  correspond  to  certain 
number  of  rounds  of  some  hierarchy,  such  as  the  Sherali-Adams  hierarchy  together  with 
consistency  of  vector  dot  products  with  pairwise  marginals.  For  Unique-Games,  gap 
instances  for  several  rounds  of  such  a  hierarchy  were  constructed  in  [127]. 
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Chapter  7 

Unique  Games  over  Integers 
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7.1  Introduction 


In  this  Chapter,  assuming  the  UGC  we  prove  that  it  is  NP-hard  to  (1  -  e,  e)-approximate 
Max  2-Linz  (and  Max  2-Linr). 

7.1.1  Motivation 

As  we  have  discussed,  Khot  et  al.  [99]  showed  that  the  UGC  is  equivalent  to  the  following 
statement:  for  any  e  >  0,  Max  2-LINg(l  -  e,e)  is  NP-hard  for  large  enough  q. 

An  obvious  question  left  open  is  whether  the  UGC  also  implies  hardness  of  solving 
two-variable  linear  equations  over  the  integers,  rather  than  over  the  integers  modulo  a 
large  constant. 

Question  7.1.1.  Is  it  true  that  for  all  constant  e,>  0,  the  Max  2-LlNz(l  -e,e)  problem  is 
NP -hard  assuming  the  UGC ? 

We  believe  that  lack  of  an  additional  quantifier  over  q  here  gives  this  question  a  certain 
aesthetic  appeal. 

7.1.2  Related  Work 

The  version  of  Question  7.1.1  for  Max  3-Lin  (i.e.,  equations  of  the  form  Vi  -  vj  +  Vk  =  cijk) 
took  a  relatively  long  time  to  be  resolved.  Hastad  proved  his  celebrated  NP-hardness  re¬ 
sult  for  Max  3-LlNg(l-e,  1/q  +  e)  in  1997  [74];  however,  it  was  not  until  a  decade  later  that 
Guruswami  and  Raghavendra  [69]  showed  that  indeed  Max  3-LlNz(l-c,c)  is  NP-hard  for 
all  constant  e  >  0.  A  relatively  simple  observation  allowed  Guruswami  and  Raghavendra 
to  also  deduce  that  Max  3-LiNr(1-£:,£-)  is  NP-hard;  here  the  equations  are  still  of  the  form 
Vi  -  Vj  +  Vk-  cijk  for  Cijk  e  but  the  variables  can  be  assigned  values  in  IR. 

A  version  of  the  Max  3-LiNr  problem  is  also  being  studied  by  Khot  and  Moshkovitz 
in  ongoing  work.  In  their  formulation,  called  ROBUST-MAX  3-LiNr,  the  constants  Cijk 
are  all  0;  however  certain  conditions  are  placed  on  how  the  variables  vl  may  be  assigned 
real  values,  so  as  to  eliminate  the  trivial  solution  i q  =  0.  Assuming  the  UGC,  Khot  and 
Moshkovitz  [101]  show  that  given  a  system  with  a  (l-e)-good  solution,  roughly  speaking  it 
is  NP-hard  to  find  a  solution  in  which  a  constant  fraction  of  the  equations  are  satisfied  to 
within  Very  recently  they  have  eliminated  the  need  for  the  UGC.  The  motivation 

for  their  work  is  the  hope  of  establishing  the  same  sort  of  result  for  ROBUST-MAX  2-LiNr, 
a  problem  closely  connected  with  Unique-Games. 

7.1.3  Statement  of  Our  Results 

In  this  work  we  show  a  positive  answer  to  Question  7.1.1.  In  fact,  our  main  theorem  is  the 
following  stronger  result: 

Theorem  7.1.2.  Assume  the  UGC.  For  any  small  constants  e  >  0,  there  exists  a  constant 
q  =  q(e)  e  N  such  that  the  following  holds:  Given  an  instance  J?  of  of  linear  equations  overl 
variables  {xkYl=1  with  the  form  xi  -  Xj  =  cij  in  which  the  integer  constants  Cij  are  in  the 
range  [-q,q\  it  is  HF-hard  to  distinguish  the  following  two  cases: 

•  There  is  a  (1  -  e)-good  integer  assignment  to  the  variables. 
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•  When  equations  are  evaluated  modulo  any  integer  m  >  q,  There  is  no  assignment  to 
the  variables  which  is  e-good  (i.e., satisfies  more  than  e  fraction  of  the  equations) 

Assuming  e  <  0.1,  it  suffices  for  q(e)  to  be  large  enough  that  0(l/g)e/(2-e)  <  e. 

An  interesting  and  somewhat  novel  aspect  of  this  result  is  that  it  gives  hardness  even 
for  a  “multi-objective”  problem.  In  the  search  version  of  Theorem  9.1.  l’s  algorithmic  task, 
although  the  algorithm  is  promised  there  is  an  extremely  good  integer  solution  to  the  given 
equations,  it  may  attempt  to  find  a  slightly  good  solution  modulo  any  m  >  0(l/g)e/(2~e)  of 
its  choosing.  We  show  that  even  still,  the  task  is  hard  assuming  the  UGC. 

From  our  main  result  Theorem  9.1.1,  we  immediately  deduce  the  following  corollaries: 
Corollary  7.1.3.  Assuming  the  UGC,  for  all  e,>0  the  Max  2-LlNz(l  -e,e)  problem  is  NP- 
hard. 

Proof.  If  there  is  a  e-good  integer  assignment  to  the  variables,  then  this  assignment  is  also 
e-good  modulo  q  (or  any  other  integer  m>q).  □ 

Corollary  7.1.4.  Assuming  the  UGC,  for  all  e  >  0  there  exists  q  such  that  the  Max  2-LlNm(l  -  e,e) 
problem  is  NP -hard  for  any  m>q,  even  for  m  =  min)  which  is  super-constant.  In  particular, 
the  algorithmic  task  in  Theorem  7.1.2  is  equivalent  to  the  UGC. 

Proof.  If  there  is  a  (1  -  e)-good  integer  assignment  to  the  variables,  it  is  also  (1  -  e)-good 
modulo  m.  □ 

Corollary  7.1.5.  Assuming  the  UGC,  for  all  e  >  0  the  Max  2-LiNr(1  -e,e)  problem  is  NP- 
hard. 

Proof.  Certainly  any  (1  -  e)-good  integer  assignment  to  the  variables  is  also  a  (1  -  e)-good 
real  assignment.  Further,  as  each  constraint  in  Theorem  9.1.1  is  of  the  form  Vi  -v  j  =  ctj  e 
Z,  any  e-good  real  assignment  to  the  variables  Vi  can  be  converted  into  a  e-good  integer 
assignment  simply  by  dropping  all  the  fractional  parts.  □ 


7.2  Overview  of  Our  Proof 

We  now  describe  the  new  ideas  we  introduce  to  prove  Theorem  9.1.1.  In  this  section,  we  as¬ 
sume  the  reader  is  closely  familiar  with  the  proof  of  the  Khot-Kindler-Mossel-O’Donnell 
(KKMO)  UGC-hardness  result  for  Max  2-LlNg(l  -  e,e).  Our  discussions  will  also  not  be 
completely  formal. 

As  KKMO  showed,  given  e  >  0  it  is  sufficient  to  construct  a  Dictator  Test  for  functions 
f  :  -*  Zq  using  2Lin-constraints,  with  the  following  two  properties:  (i)  dictator  functions 

fix)  =  xi  pass  the  test  with  probability  at  least  1  -e;  (ii)  any  f  :  Z^  — »  Aq  with  all  influences 
smaller  than  t  passes  the  test  with  probability  at  most  l/ge^2_e^  +  7c,  where  the  “error  term” 
k  -  x(q,e,  t)  can  be  made  arbitrarily  small  by  taking  t  >  0  to  be  a  sufficiently  small  con¬ 
stant  independent  ofL.  Here  Aq  is  the  convexification  of  Zq;  i.e.,  the  set  of  all  probability 
distributions  over  Zq. 

As  a  first  step  one  might  try  extending  the  KKMO  analysis  to  Max  2-LlNm,  where  m 
is  “super-constant”.  The  essential  difficulty  is  that  applying  the  key  tool,  the  Majority  Is 
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Stablest  Theorem,  to  r-small-influence  functions  f  :  [m]L  —  [0,1]  introduces  an  error  term 
x(m,e,  t)  which  depends  on  m.  If  m  is  super-constant,  even  as  a  function  of  L,  this  will 
cause  the  KKMO  reduction  from  Unique-Games^  to  fail;  in  particular,  it  means  that 
in  the  soundness  case,  one  would  decode  such  f’s  to  o)l(  1)  many  labels  in  [L],  which  is 
unacceptable. 

Since  we  presumably  must  use  the  Majority  Is  Stablest  Theorem,  and  since  we  also 
care  about  constraints  modulo  a  super-constant  m,  we  are  led  to  consider  Dictator  Tests  for 
functions  f  :  [q\L  —  Zm.  We  are  not  aware  of  any  prior  work  on  testing  such  functions,  with 
differing  domain  and  range  (arguably,  the  work  on  hardness  of  ordering  constraints  [67] 
has  some  of  the  same  flavor).  An  initial  difficulty  in  working  with  such  functions  is  that 
the  usual  method  of  “folding”  no  longer  makes  sense.  Our  first  observation  is  that  one  need 
not  fold  by  the  usual  method  of  restricting  the  domain  by  a  factor  of  q;  instead,  one  can 
build  folding  directly  into  the  KKMO  test.  I.e.,  KKMO’s  result  could  be  obtained  via  the 
following  Dictator  Test  for  functions  f  :  — »  Zq:  Choose  x,x'  ~  to  be  (1  -  e)-correlated 
random  strings,  choose  also  c,  c'eZ,  uniformly  and  independently,  and  then  test  the  2Lin 
constraint 

f(x  +  (c,c,...,c))-c- fix'  +  ic',c',...,c'))-  c'.  (7.1) 

To  analyze  the  soundness  of  this  test,  one  introduces  the  “randomized  (or  d)  function”  g  : 
Z^  — ►  A  q  defined  by  gix)  =  gix+ic,.. .  ,c))-c,  in  which  case  the  probability  that  f  passes  the 
test  is  §i_e[g].  One  then  observes  that  E[ga(x)]  =  1  Jq  for  each  coordinate  output  function 
§a-  Zj  — » [0, 1],  a  £  Z q.  Thus  one  can  apply  the  Majority  Is  Stablest  to  bound  §i-e[g]  by 

q(T(l/q)  +  x(q,e,  t))  <  (1  lq)el(2~e)  +  ol( D, 


as  necessary. 

We  will  show  how  to  extend  this  analysis  to  functions  f  :  [q)L  — ►  Zm,  where  m  >  q. 
Proceeding  with  the  same  “built-in  folding”,  we  obtain  the  function  g  :  Vq\L  -A  m  which  has 
the  property  that  E[ga(x)]  <  1/g  for  each  a  £  [m\.  Our  main  technical  result,  Lemma  7.4.3, 
shows  that  this  is  sufficient  to  prove 

Si_c[g]  =  Y.  -  (1  /g)e/(2_e)  +  qlogqK(q,e, t)  =  (1  /q)e/{2~e)  +  ol(  1). 

ae[m] 

The  key  point  here  is  that  the  error  term  does  not  depend  at  all  on  m,  and  hence  the  overall 
analysis  works  even  for  m  super-constant.  To  evade  dependence  on  m,  the  idea  is  that  one 
can  obtain  the  bound  §i-eLga]  ^  E[ga](l/g)c/2  without  any  small-influences  assumption  at 
all  if  E[ga]  <  q~logq;  one  only  needs  to  use  hypercontractivity. 

These  ideas  let  us  obtain  UG-hardness  of  Max  2-Linot(1  -e,e)  even  for  super-constant 
m.  To  complete  the  proof  of  our  main  Theorem  9.1.1,  we  need  to  improve  the  completeness 
aspect  of  the  Dictator  Test  so  that  even  integer-valued  dictators  f  :  [q]L  —>  Z  pass  with 
probability  close  to  1.  An  observation  here  is  that  an  integer-valued  dictator  fix)  =  Xi 
already  passes  our  test  with  probability  close  to  1/2:  Ignoring  the  c-noise,  the  test  (7.1) 
fails  only  if  one  of  X;  +  c,  x  +  c'  “wraps  around”  modulo  q  but  the  other  doesn’t. 

There  is  a  very  simple  idea  for  decreasing  the  probability  of  such  wraparound:  choose 
c  and  c'  from  a  range  smaller  than  [g].  E.g.,  if  we  choose  c,c'  ~  Vq/t),  then  we  get  wrap¬ 
around  in  Xi  +  c  with  probability  at  most  1  It.  Hence  integer-valued  dictators  f  :  [ q\L  — »  Z, 
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fix )  =  Xi  will  pass  the  test  in  (7.1)  with  probability  at  least  1  -e-2/t.  How  does  this 
restricted  folding  affect  the  soundness  analysis?  It  means  that  the  associated  randomized 
function  g  :  [g]L  — >  Am  will  only  satisfy  E[ga]  <  t/q  for  each  a  £  \m\,  rather  than  E[ga]  < 
1/q.  But  this  is  still  sufficient  for  our  technical  Lemma  7.4.3  to  bound  §i-e[g]  by  roughly 
(t/g)e/(2_e).  Thus  by  taking  t  =  log(g),  say,  we  get  a  2Lin-based  Dictator  Test  having  integer¬ 
valued  completeness  l-e-0(l/log(g))  and  Zm-valued  soundness  0(l/g)e/(2_e)  for  any  m>q. 
This  suffices  to  establish  our  main  Theorem  9.1.1. 

7.2.1  Comparison  with  Guruswami-Raghavendra 

Here  we  briefly  compare  our  methods  with  those  Guruswami  and  Raghavendra  [69]  used 
to  establish  hardness  for  Max  3-Lin^.  Although  they  also  mentioned  Max  3-Linto  for 
very  large  m  in  the  overview  of  their  work,  their  methods  are  somewhat  more  integer- 
specific  than  ours.  In  particular,  they  worked  with  Dictator  Tests  on  functions  f  :  Z, 

using  a  certain  exponential  distribution  on  the  domain  Z+.  (Ultimately,  of  course,  they 
truncated  the  distribution  to  a  finite  range.)  This  necessitated  introducing  and  analyzing 
a  somewhat  technical  method  of  decoding  functions  f  to  coordinates  associated  to  sparse 
Fourier  frequencies  o)  e  Y0,2n\L  with  large  Fourier  coefficients. 

Guruswami  and  Raghavendra  also  described  their  Dictator  Tests  as  “derandomized 
versions”  of  Hastad’s  tests,  where  the  amount  of  randomness  of  the  test  depends  only  on 
the  soundness.  The  same  could  be  said  of  our  result  vis-a-vis  KKMO’s  Dictator  Tests:  we 
get  Max  2-LlNm,  Dictator  Tests  in  which  the  size  of  the  domain  elements,  q,  depends  only 
on  the  desired  soundness  of  the  test. 


7.3  Definitions  and  analytic  tools 

7.3.1  Notation 

For  r  £  R+  we  let  [r]  denote  {1,2,. . .,  LrJ}.  Given  m  £  N  we  write  ©m  for  addition  modulo  m. 
It  will  also  be  convenient  to  use  the  following  slightly  unusual  notation: 

Definition  7.3.1.  We  write  Zm  for  the  group  of  integers  modulo  m.  We  will  also  sometimes 
identify  this  set  with  [m]  <=  Z,  not  with  the  more  standard  {0, 1, . . . ,  m  - 1}.  Finally,  we  extend 
the  notation  to  m-  oo,  in  which  case  we  understand  Zm  to  mean  simply  the  integers,  Z. 
Definition  7.3.2.  We  write  Am  for  the  set  of  probability  distributions  over  Zm  with  finite 
support;  when  mfoowe  can  identify  Am  with  the  standard  ( m  - 1  )-dimensional  simplex  in 
Rm.  We  also  identify  an  element  a  £  Zm  with  a  distribution  in  Am,  namely,  the  distribution 
that  puts  all  of  its  probability  mass  on  a. 

7.3.2  Noise  stability  and  influences  on  \qY  — *■ 

Here  we  need  to  used  a  generalization  of  the  Harmonic  Analysis  introduced  in  Section  3.1.1. 
We  will  be  considering  functions  of  the  form  f  :  [q]n  — ►  IRm  (as  oppose  to  R),  where  q,n,me 
N.  We  will  also  allow  m  =  oo,  in  which  case  we  interpret  the  range  as  all  sequences  in 
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Kz  with  at  most  finitely  many  nonzero  coordinates.  The  set  of  all  functions  f  :  [g]n  -*  Um 
forms  an  inner  product  space  with  inner  product 

(f,g)  =  E  [(f(x),g(x))]; 

x~[q]n 

here  we  mean  that  x  is  uniformly  random  and  the  (•,  •)  inside  the  expectation  is  the  usual 
inner  product  in  Km.  We  also  write  ll/’ll  =  \J < f,f )  as  usual. 

For  0  <  p  <  1,  we  define  Tp  to  be  the  linear  operator  on  this  inner  product  space  given 
by 

Tpf(x)  =  E[/(y)], 

x 

where  y  is  a  random  string  in  [g]L  which  is  p-correlated  to  x.  We  define  the  noise  stability 
of  f  at  p  to  be 

§  P\n=(f,TPn. 

For  i  e  [re],  we  define  the  influence  of  i  on  f  :  [qY1  -*  Km  to  be 

Infj[/1  =  E  [VarXi^[/-(x)]], 

where  Vart/1]  is  defined  to  be  E[ ||  /* || 2]  -  ||E[/]||2.  More  generally,  for  0  <  q  <  1  we  define  the 
rj-noisy -influence  of  i  on  f  to  be 


Inf*.1~??)[/’]  =  InfjtTi-jj/']. 


One  may  observe  that 

m 

In ^[f]  =  £  Inf f^ifjl 

j= 1 

where  fj  :  \q\n  —>  [R  denotes  the  ,/th-coordinate  output  function  of  f .  (When  m-oo  the  sum 
should  be  over  j  e  Z.) 

We  will  need  the  following  “convexity  of  noisy-influences”  fact: 

Proposition  7.3.3.  Let  be  a  collection  of  functions  \q]n  — *■  Um.  Then 


<  avg  {inf^1  ,;)[/‘(fe)]} 
ke[t]  1  J 


Here  for  any  c\,C2,...ct  £  IK  (or  KmJ,  we  use  the  notation  avg(ci,...,cO  to  denote  their 
average: 

LUi  a 

t 


Following  fact  is  well  known: 
Fact  7.3.4.  For  any  q, 


Eln£(TiV)< 

i-1 


Var  (/) 
2eq 


148 


7.3.3  Hypercontractivity  and  Majority  Is  Stablest 

Recall  the  hypercontractivity  on  [g]". 

Theorem  7.3.5.  Let  q  >  2,  f  :  [ q]n  -*  R,  and  0  <  e  <  1.  Then 

WT^fh  <  \\f\\p,  where  p  =  piq,e)  =  1  +  _ 


The  second  tool  we  need  is  the  Majority  is  the  Stablest  Theorem. 


Theorem  7.3.6.  Suppose  f  :  [qT  -  [0,1]  has  Inf*1  v)[ f]  <  t  <  ( \ogq)-Qogq)/c  for  all  i  £  \n\, 
where  q  <  c(logg)/log(l/r)  and  c  >  0  is  a  certain  universal  constant.  Let  p  =  E[/*].  Then  for 
any  0  <  e  <  1, 


Si_e[/']<r1_e(p)  +  ^ 
ce 


loglog(l/T) 

log(l/r) 


This  is  essentially  a  special  case  of  Theorem  3. 2.4, with  the  error  bound  explicitly  given. 
Proposition  7.3.7.  Assume  0  <  e  <  .1  and  0  <  p  <  exp(-l L/e)/^/e.  Then  T i-eip )  <  /i1+e/(2_e). 

This  estimate  follows  from  Corollary  10.2  in  [99].  (The  expression  in  that  corollary  is 
in  fact  an  upper  bound  on  T i~£ip)  for  all  0  <  e  <  1  and  0  <  p  <  1/2,  as  can  be  verified  using 
the  inequality  in  Proposition  6.1  of  [99].  The  simplified  bound  ^1+e/(2-e)  holds  when  e  <  .1 
and  p  <  exp(-l/ \/e)/ ^/e .) 


7.4  Dictator  Tests 

In  this  work  we  will  be  considering  two-variable  linear  equation  constraints;  specifically, 
testing  functions  f  :  [ q]n  — »  Zm  using  constraints  of  the  form  fix)  -  fiy)  =  c,  where  ce  Z. 

Before  defining  Dictator  Tests  we  need  to  introduce  another  small  technical  detail,  that 
of  testing  averages  of  functions.  Given  a  test  for  functions  f  :  [g]n  — *  Zm,  say,  we  can  think 
of  it  more  generally  as  a  test  for  functions  f  :  YqY1  — »  Am.  To  understand  this,  one  should 
think  of  a  function  with  range  Am  as  a  “randomized”  function  into  Zm.  I.e.,  to  apply  the 
test  3~  to  a  function  f  :  [g]L  — >  Am,  one  first  chooses  a  random  constraint  as  usual  in  5~ ; 
say  it  is  fix)  -  fiy)  =  c.  One  then  chooses  a  ~  fix)  and  a  ~  fiy)  (independently)  and  finally, 
one  checks  the  constraint  a- y  =  c. 

We  may  now  informally  state  what  a  Dictator  vs.  Small  Noisy -Influences  Test  is.  It  is 
a  test  for  functions  f  :  \_q]n  —  Am  with  the  following  two  properties:  (i)  Dictator  functions 
—  i.e.,  functions  of  the  form  fix)  =  xi  —  pass  the  test  with  high  probability.  (Here  we  are 
interpreting  the  integer  x;  e  [g]  also  as  an  element  of  Zm,  and  thus  also  as  an  element 
of  Am.)  In  other  words,  Val srif)  is  large  when  f  is  a  dictator,  (ii)  Functions  f  satisfying 
Inf*1  ^[/d  <  t  for  all  i  £  [n]  pass  the  test  with  low  probability,  where  here  q  and  t  should 
be  thought  of  as  very  small  constants.  More  formally: 

Definition  7.4.1.  Let  ST  be  a  test  for  functions  f  :  [ q]n  —  Am.  We  say  that  ST  has  complete¬ 
ness  at  least  c  if  every  dictator  function  fix)  -  Xj  passes  the  test  with  probability  at  least 
c.  We  say  that  ST  has  (T,77)-soundness  at  most  s  if  every  function  f  :  VqT  —  Am  satisfying 
Inf*1  ^[/d  <  t  for  all  i  £  [ n\  passes  the  test  with  probability  at  most  s.  Finally,  given  a  fam¬ 
ily  of  tests  iSTn),  where  3~n  test  functions  f  :  VqY1  — »  Am,  we  say  it  has  soundness  s  if  for  every 
k  >  0  there  exists  t,  q  >  0  such  that  each  STn  has  (t,  qfsoundness  at  most  s  +  k. 
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We  now  state  our  new  family  of  Dictator  vs.  Small  Noisy-Influences  Tests.  Given  pa¬ 
rameters  0  <  e  <  1  and  q  e  l\l,  we  define  the  following  test  STq^  for  functions  f  with  do¬ 
main  [g]L: 

Test  3~q/. 

•  Choose  x,x'  ~  [g]L  to  be  a  pair  of  (1  -  e)-correlated  random  strings. 

•  Choose  c,c'  ~  [g/log(g)]  independently  and  uniformly. 

•  Define  y  =  x  ®q  (c,  c, . . . ,  c),  and  define  y'  -  x'  ©g  (c,  c, . . . ,  c ). 

•  Test  the  constraint  “  f(y)  -  c  =  f(y')  -  c'  ”  (equivalently,  “  f(y)  -  f(y')  = 
c-c'”). 

As  discussed,  one  can  also  think  of  this  test  as  an  explicit  weighted  CSP  of  Max  2- 
Lin  type  over  the  variable  set  [qlL .  The  constraint  f{y)  -  c  =  f(y')  -  c'  should  be  thought 
of  as  a  formal  expression,  since  we  have  not  yet  specified  the  range  of  the  assignment  f . 
In  fact,  we  will  analyze  the  test’s  properties  when  the  range  of  f  varies  over  different  Zm’s. 


To  prove  our  main  Theorem  9.1.1  it  will  suffice  (as  we  verify  in  Section  7.5)  to  show  the 
following. 

Theorem  7.4.2.  The  Dictator  Test  STq^  uses  integer  constants  Cij  in  [-q/\og(q),q/log(q)] 
and  has  the  following  two  properties: 

Completeness:  For  each  meNu  {oo},  the  L  dictator  functions  f  :  [ q]L  —> -  Zm  pass  the 
test  dTq  e  with  probability  at  least  1-e-  0(l/log(g». 

Soundness:  Assume  0  <  e  <  .1  and  that  q  >  expll/y^)  an  integer.  Assume  f  :  [g]L  -*  Am 
satisfies  Inf*1  f  ]  <  r  <  (logg)-(log<7)/c  for  all  i  e  [L],  where  q  <  c(logg)/log(l/r)  (and  c  is  the 
constant  from  Theorem  7.3.6).  Assume  further  that  q/\og{q)  <  m  <  oo.  Then  f  passes  the 
test  3Tq  e  with  probability  less  than 


0(l/g)e/(2-e)  + 


0(qlogq)  loglog(l/x) 
e  logiH  r) 


The  Completeness  part  of  Theorem  7.4.2  is  easy  to  verify: 

Proof.  Suppose  fix)  =  xj  for  some  j  e  [L],  In  the  test  STq^  we  have  xj  -  x(  except  with 
probability  at  most  e.  When  the  event  happens,  write  b  for  the  common  value.  We  further 
have  that  b  is  at  most  q  -  Lq71og(g)J  except  with  probability  at  most  0(l/log(g)).  Thus  with 
probability  at  least  1-e  -  0(l/log(g»  we  have  both  yj  =  b  +  c  and  y '.  =  b  +  c'  as  integers 
in  [q\,  i.e.,  the  ©g  does  not  cause  “wrap-around”.  Thus  f(y)  will  equal  the  integer  b  +  c 
within  Zm,  and  similarly  fiy')  will  equal  b  +  c'  within  Zm,  and  the  tested  constraint  will 
be  satisfied.  □ 

The  next  two  subsections  of  the  work  are  devoted  to  the  proof  of  the  Soundness  part  of 
Theorem  7.4.2.  In  the  first  subsection  we  prove  a  technical  lemma  bounding  the  noise  sta¬ 
bility  of  functions  f  :  [ q\L  —  Am  which  have  ||/j  ||oo  small  for  each  j  e  Zm.  In  the  subsequent 
subsection,  we  complete  the  proof  of  the  soundness  of  our  test. 
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7.4.1  Technical  lemma 


Our  soundness  analysis  relies  on  the  following  technical  lemma;  the  crucial  aspect  of  it  is 
that  the  upper  bound  we  give  on  the  noise  stability  does  not  depend  on  m. 

Lemma  7.4.3.  Fix  0  <  e  <  .1  and  let  q  >  expfl/y^)  be  an  integer.  Further,  let  L,m  eN  and 
0  <  tj  <  1.  Assume  g  :  \_q\L  — *  Am  satisfies  Inf*1  ^[g]  <  t  <  (logg)_(log9)/c  for  all  i  e  [L],  where 
t i  <  c(logg)/log(l/j)  (and  c  is  the  constant  from  Theorem  7.3.6). 

Then  i/‘Ex[g(x)a]  <  lo g(q)/q  for  all  a  e  \m\,  it  follows  that 


Si-e[£]<0(l/g)e/(2-e)  + 


0(qlogq)  loglogfl/x) 
e  log(l/r) 


Proof.  Write  pa  =  Ex[g(x)a].  We  use  two  different  bounds  for  §i-e[ga]  depending  on  the 
magnitude  of  pa-  The  first  bound  uses  the  small  noisy- influences  of  ga  (which  are  certainly 
smaller  than  those  of  g)  and  the  Majority  Is  Stablest  Theorem  (Theorem  7.3.6),  yielding 


Sl-eLgJ  <  r  1  -e(ga)  +  e(r), 


e(r):= 


log  q 

ce 


loglog(l/T) 

log(l/r) 


We  may  also  use  Proposition  7.3.7  because  e  <  .1  and  pa  <  log(g)/g  <  exp(-l /s/e)/\fe;  thus 


§i-e[ga]  <  hl+e/(2  £)  +  e(T). 


(7.2) 


Our  second  bound  is  more  useful  when  pa  is  extremely  small;  it  only  needs  the  hyper- 
contractivity  theorem  (Theorem  7.3.5),  and  not  the  small  noisy-influences  condition.  The 
theorem  gives 

Sl-eteJ  =  WT^gaWl  ^  \\gatp  =  E  VgPafP  <  E[ga]2/^  =  , 

where  p  =  1  +  (1  -  e/2_4/q,^log^_1)  as  in  Theorem  7.3.5.  One  can  check  that  2/p  >  1  + 
e/(1.91ogg)  for  all  0  <  e  <  1  and  q  >  3;  hence: 

(7.3) 


We  now  put  the  two  bounds  together: 

§i-e[g]  =  £  §i-e[gJ 

ae[m] 

=  £  §l-e[gj  +  £  §l-e[ga] 

a:^a>q-loeq  a:fia<q-losg 

£  (p1+e/(2-e)  +  e(T))  +  £  g1+e/(L91og<?)  (using  (7.2),  (7.3)). 

a:Ha>q~loe(!  a:^a<q-l°  SI 

Since  g’s  range  is  Am  we  have  Lae[ m]Ta  =  1-  Thus  the  first  sum  above  is  at  most 

glog9e(r)  +  £  p£e/(2-e)  <  glog9e(T)  +  max /i„(2_e)  <  qlogqe(T )  +  (log(g)/g)c/(2_e) 

a:iia>q-l°S1 
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using  the  assumed  upper  bound  on  /ia.  The  second  sum  above  is  at  most 


e/(1.91ogg)  ,  - 

max  iua  <  (q 

log  q 


logq,^e/(1.91og<7)  _  ^-e/1.9 


=  Q 


a'-Ha<q 


Thus  we  conclude 

Si-efe]  <  qlogqe(r)  +  (lo g(q)/q)eK2-£)  +  q~e/ 19  <  OiXlqf1^  +  •  logl°g(1/T) 

e  log(l/T) 

as  claimed.  □ 

7.4.2  Soundness  of  the  test 

This  section  is  devoted  to  the  proof  of  the  Soundness  part  of  Theorem  7.4.2. 

Proof.  Given  f  as  in  the  statement  of  the  theorem,  we  introduce  another  randomized 
function  g  :  [ q\L  — »  Am.  Specifically,  g(x )  is  defined  to  be  the  distribution  function  on  a  £  Zm 
given  by  the  following  experiment: 

•  Choose  c  ~  [g/log(g)]  uniformly  at  random. 

•  Choose  b  according  to  the  distribution  f(x  ©g  (c,  c, . . . ,  c)). 

•  Define  a  =  b  -  c  £  Zm. 

Thus  in  the  test  3~q,e,  once  x  and  x'  are  chosen  the  probability  that  f  passes  the  test  is 
equal  to  the  probability  that  independent  draws  from  g(x)  and  g(x')  yield  the  same  value 
in  Zm.  I.e., 

Priy  passes  the  constraint]  =  E  [<g(x),g(x,)>]  =  §i-e[g]. 

x,x' 

It  thus  suffices  to  bound  §i_e[g]. 

Our  first  task  is  to  show  that  g  has  small  noisy-influences.  Define  the  operator  Sc  for 
c  £  Zg  as  follows:  Sch(x)  =  h(x  ®q  (c,  c, . . . ,  c)).  Define  the  operator  Rc  for  c  £  Zm  as  follows: 
(Rch(x))a  =  h(x)a+c,  where  the  sum  a  +  c  is  within  Zm.  Hence  by  definition, 

g=  avg  {. RcScf }.  (7.4) 

c£[q/log(q)] 

In  particular,  for  each  i  e  [L]  we  have 

inff-^tg]  <  avg  a^-^iRcScn} 

ce[g/log(g)] 

by  the  convexity  of  noisy-influences  (Proposition  7.3.3).  But  it’s  easy  to  see  that  Inf*1  Rch ] 
Inf*1  and  Inf*1  ^[Sc/2.]  =  Inf*1  Hence  we  conclude  Inf*1  ^[g]  <  Ini*1  ^[/"]  <  r  for 

all  \  £  [LI 

We  now  make  the  key  observation.  For  a  £  Zm ,  define  /ia  =  Ex_[(?]L[g(x)a].  Using  the 
original  definition  of  g  we  have 

/ia  =  Pr  x,c~[q/\og{q )],  [6  —  C  =  Cl]  =  E  [/Xx  ffig  (c,  C, . . . ,  c))a+c], 
b~f(xsq(c,c,...,c))  x,c~[q/log(q)] 
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where  the  expressions  b-c  —  a  and  a  +  c  are  treated  within  Zm.  But  the  joint  distribution 
of  c  and  x  ©g  (c,c,.. . ,c)  is  identical  to  the  joint  distribution  of  c  and  y,  where  y  ~  [g]L  is 
uniform  and  independent  of  c.  Hence 

pa=  E  Yf(y)a+c\  -  max  1  E  [f(y)a+c\\  <\og{q)/q  forallaeZm,  (7.5) 

y,c~[q/\og(q)\  ye[q]L  ( c~[q'/log(q')]  J 

since  Y.b  f(x)b  =  1  and  m  >  g/log(g). 

Having  established  (7.5)  and  also  Inf*1  ^[g]  <  r  for  all  i,  we  may  bound  Si-e[g]  and 
thus  complete  the  proof  using  the  technical  Lemma  7.4.3.  (In  the  case  that  m-oo  we  may 
still  apply  the  lemma  because  g’s  outputs  are  nonzero  on  only  finitely  many  coordinates; 
hence  we  may  consider  g’s  range  to  be  a  finite-dimensional  simplex.)  □ 


7.5  The  Reduction  from  UNIQUE-GAMES^, 

In  this  section  we  show  how  to  use  our  Dictator  Test  to  obtain  our  main  UG-hardness 
result,  Theorem  9.1.1.  We  reiterate  that  we  are  essentially  using  the  reduction  implic¬ 
itly  proved  in  [99];  we  give  the  full  deduction  here  for  completeness  and  because  we  are 
working  in  a  slightly  nonstandard  setting. 

For  technical  convenience,  we  will  use  the  following  equivalent  version  of  the  UGC  due 
to  Khot  and  Regev  [94,  Lemma  3.6]: 

Theorem  7.5.1.  Assume  the  UGC.  For  all  small  (,y  >  0,  there  exists  L  e  N  such  given  an 
unweighted  Unique-Games^  instance  <£  =  (U ,V ,E ,(jtu,v)(u,v)€e)  which  is  U -regular,  it  is 
NP -hard  to  distinguish  the  following  two  cases: 

1.  There  is  an  assignment  A  :  ( U  uV")  —  [L]  and  a  subset  U'  <^U  with  \U'\/\U\  >  l-£  such 
that  A  satisfies  all  constraints  incident  on  U'. 

2.  There  is  no  y-good  assignment  A. 

Our  main  task,  which  we  will  carry  out  in  the  next  subsection,  will  be  to  prove  the 
following  slight  variant  of  Theorem  9.1.1,  wherein  we  write  s(q,e)  =  0(llq)eii2~e)  for  the 
main  term  in  the  Soundness  part  of  Theorem  7.4.2: 

Theorem  7.5.2.  Fix  0  <  e  <  .1  rational  and  q  >  expfl/^/c)  an  integer.  For  any  LeN,  there 
is  a  polynomial-time  reduction  mapping  non-bipartite,  unweighted  Unique-Games^  in¬ 
stances  <£  into  Max  2- Lin  instances  2?  having  the  following  properties: 

•  (Completeness.)  If  statement  1  in  Theorem  8.4.2  holds  for  CS,  then  there  is  an  integer 
assignment  to  the  variables  in  4?  satisfying  at  least  (1  -  0(1  0{l/\og{q)))-weight 
of  the  equations. 

•  (Soundness.)  If  there  is  no  y-good  assignment  for  <£  where  y  =  y(q,e)  >  0  is  sufficiently 
small,  then  there  is  no  integer  assignment  to  the  variables  in  J1  which  satisfies  at  least 
(3s(q,e))- weight  of  the  equations  modulo  m,  for  any  integer  m  >  q/\og{q). 

By  combining  Theorem  8.4.3  with  Theorem  8.4.2,  taking  (  =  l/log(g)  and  y  =  y{q,e)  >  0 
as  necessary,  we  obtain  the  following  variant  of  Theorem  9.1.1: 

Theorem  7.5.3.  Assume  the  UGC.  For  any  0  <  e  <  .1  rational  and  q  >  exp(l/0e)  an  integer, 
the  following  holds:  Given  an  instance  of  Max  2-Lin  in  which  the  integer  constants  ctj 
are  in  the  range  [-ql\og{q),q/\og{q)\,  it  is  NP -hard  to  distinguish  the  following  two  cases: 
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•  There  is  a  (1  -  e  -  0(l/log(q)))-good  integer  assignment  to  the  variables. 

•  There  is  no  assignment  to  the  variables  which  is  0{1/ q)el(2~e^ -good  modulo  any  integer 
m  >  g/log(g). 

From  this,  we  can  deduce  our  main  Theorem  9.1.1  for  e'  and  e'  by  taking  e  in  Theo¬ 
rem  7.5.3  a  rational  of  the  form  e'  -  0(l/log(g)). 

7.5.1  Proof  of  Theorem  8.4.3 

We  now  prove  Theorem  8.4.3. 

Proof.  The  reduction  is  essentially  as  in  [99],  Given  the  Unique-Games^  instance  <£  = 
(U,V,E,(nuv)),  the  reduction  produces  a  weighted  Max  2-Lin  instance  with  variable 
set  V  x  [q\L .  We  think  of  an  assignment  F  to  these  variables  as  a  collection  of  functions 
fv  '■  [d]L  — *■  one  for  each  v  eV.  Here  we  will  allow  g/log(g)  <  m  <  oo.  For  each  u  g  V  we 
also  introduce  the  randomized  function  fu  :  [g]L  — ►  Am  defined  by 

fu(x)=  E  [ffuu(x)\, 

u\(u,u)eE 

where  define  the  functions  ff  :  [q]l  — »  Tm  by 

ff  (x)  =  fv(x  o  7T_1),  with  x  o  7T_1  e  [ q]L  defined  by  (x  o  7r_1)j  =  x„-i (jy 

We  now  define  the  instance  according  to  the  following  probabilistic  test: 

•  Choose  ueU  randomly. 

•  Apply  test  3~qte  from  Section  7.4  to  fu. 

Note  that  by  the  definition  of  applying  a  test  to  a  randomized  function,  this  indeed  makes 
JP  a  weighted  Max  2-Lin  instance  over  the  variables  V"  x  [q\L ■  Further,  it  is  easy  to  check 
that  the  reduction  from  ^  to  J?  thus  defined  can  be  carried  out  in  polynomial  time  assum¬ 
ing  e,  q,  and  L  are  constant. 

To  prove  the  Completeness  part  of  Theorem  8.4.3,  suppose  that  assignment  A  and  sub¬ 
set  U'  £{/  are  as  in  statement  1  of  Theorem  8.4.2.  Define  an  integer- valued  assignment  F 
for  by  taking  fv(x)  =  xa(v)-  Then  by  definition  and  by  the  property  of  A,  we  will  have  that 
fu  :  t q]L  -*  Az  is  in  fact  the  A(u)th  dictator  function  for  all  u  eU' .  Thus  by  the  complete¬ 
ness  part  of  Theorem  7.4.2,  assignment  F  will  pass  the  test  with  probability  at  least 
Pr[u  g  U']  ■  (1  -  e  -  0(l/log(g)))  >  (1  -  0(1  -  £  -  0(l/log(g))).  This  finishes  the  Completeness 
part  of  Theorem  8.4.3. 

As  for  the  Soundness  part  of  Theorem  8.4.3,  choose  r  =  r(q,e)  >  0  small  enough  so  that 
the  error  term  in  the  Soundness  part  of  Theorem  7.4.2  is  at  most  the  main  term,  s(q,e); 
choose  also  q  =  q(q,e)  >  0  sufficiently  small  so  that  the  hypothesis  therein  holds.  By  way  of 
proving  the  contrapositive,  suppose  that  there  is  an  integer  m  >  g/log(g)  and  a  Zm-valued 
assignment  F  to  which  passes  the  test  with  probability  at  least  3 s(q,e).  Then  by  an 
averaging  argument,  there  must  be  some  subset  U'  c  V  of  fractional  size  at  least  s(q,e) 
such  that  when  u  e  U',  the  test  S~q>e  passes  fu  with  probability  at  least  2 s(q,e).  It  follows 
from  the  Soundness  part  of  Theorem  7.4.2,  along  with  our  choice  of  r  and  q,  that 

for  all  u  g  U' ,  3iue[L\  s.t.  Inf^fyj  >  r.  (7.6) 

t'U 
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By  definition  of  fu  and  by  the  convexity  of  noisy-influences  (Proposition  7.3.3)  we  deduce 
that  for  each  such  ueU'  and  iu  e  [L], 


t<  avg  {inff  T,\f”uu]\  =  avg  1W1  ^ 

V.(u,v)eE  [  *“  ]  r,(„  „V:K  l  nuv(lu)  > 


v:(u,v)eE 


7luv(iu) 

For  each  v  e  V"  let  us  define 


t/2  <  Inf*1  .[/■„]  for  at  least  a  r/2-fraction  of  u’s  neighbors  v.  (7.7) 


C(y)  =  { j  £  [L] :  In^'^f/-,]  >  r/2}; 

thus  by  (10.9)  we  have: 

Vue  I/',  7tuv(iu)  £  C(u)  for  at  least  a  r/2-fraction  of  u’s  neighbors  v  eV.  (7.8) 

We  claim  and  will  show  shortly  that  |C(u)|  <  l/(pr)  for  all  v.  Having  established  this, 
consider  choosing  a  random  assignment  A  :  (U  uV)  —*  [L]  as  follows:  for  u  eUr  set  A(u)  = 
iu;  for  v  £  V,  choose  A(v)  randomly  from  C(v)  (assuming  the  set  is  nonempty);  finally, 
set  A(w)  arbitrarily  in  [L]  for  all  unassigned  vertices  w.  Now  by  (7.8),  for  each  u  £  U'  the 
expected  fraction  of  constraints  incident  on  u  which  A  satisfies  is  at  least  (T/2)/(pr)  =  tjt2/ 2. 
Since  \U’\/\U\  >  s(q,e)  and  ^  is  U -regular,  we  conclude  that  the  expected  fraction  of  all 
constraints  in  ^  that  A  satisfies  is  at  least  s{q,e)rjT2l2.  Taking  y  =  j{q,e)  =  s(q,e)r]T2/2,  we 
conclude  that  there  must  exist  a  y-good  assignment  for  CS. 

It  remains  to  verify  the  claim  that  |C(u)|  <  1  /(t]t)  for  all  v.  This  is  true  by  Fact  7.3.4. 
We  also  need  the  following  small  observation:  even  for  arbitrarily  large  m,  if  h  :  [q\L  —*  Am 
then  Yar[h]  <  1.  This  is  because  Var[/i]  <  E[||/i||2]  <  maxx{||/i(x)||2}  <  1,  as  every  point  in 
Am  has  Euclidean  norm  at  most  1.  Thus 

\C(v)\  =  | {j  £  [L] :  >  r/2} |  <  <  1/(77T), 

J  t/2 


as  claimed. 


□ 
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Chapter  8 

On  Hardness  of  vertex  pricing 
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8.1  Introduction 


We  study  the  item  pricing  problem  which  is  a  CSP  with  the  constraint  being  a  generalized 
payoff  function. 

8.1.1  Motaviation  and  Background 

A  informal  description  of  the  problem  is  as  follows:  a  seller  has  an  infinite  supply  of  n 
different  items.  There  are  m  buyers,  each  of  which  are  interested  in  a  subset  of  the  items 
with  certain  budget  limit.  These  buyers  are  all  single  minded ;  i.e.,  they  either  buy  all  the 
items  they  are  interested  in  if  the  overall  cost  is  within  their  budget  or  they  will  buy  none 
of  them.  The  algorithmic  task  is  to  price  each  item  i  with  a  profit  margin  p,;  to  maximize 
the  overall  profit  of  the  seller. 

Serval  results  were  known  when  the  profit  margin  pj  on  each  item  is  required  to  be 
positive.  A  0(logn  +  logm)  approximation  for  the  general  problems  is  given  by  Guruswami 
et  al.  [64].  If  we  assume  that  each  customer  is  only  interested  in  a  constant  number  k 
of  the  items,  a  0(£2)-approximation  algorithm  was  given  in  [26]  by  Briest  and  Krysta. 
Later  in  [16],  Balcan  and  Blum  improved  the  approximation  ratio  to  O(k).  In  particular, 
when  k  =  2  (such  a  problem  is  also  called  graph  vertex  pricing),  their  algorithm  gave  an 
4-approximation.  On  the  hardness  side,  an  APX-hardness  result  was  obtained  for  the 
general  problem  in  [64],  Later,  Demaine,  Feige,  Hajiaghayi,  and  Salavatipour  obtained  a 
poly-logarithmic  hardness  [38].  As  for  the  case  that  each  customer  is  only  interested  in  at 
most  2  of  the  items,  a  2-hardness  result  was  obtained  in  [93]  assuming  the  Unique  Games 
Conjecture  (UGC). 

Much  less  is  known  when  the  seller  is  allowed  to  assign  negative  profit  margin  p  j  for 
some  of  the  items.  The  motivation  behind  selling  some  items  below  the  margin  cost  is  to 
increase  the  overall  profit  by  stimulating  the  sales  of  other  products.  These  items  sold 
below  the  cost  are  usually  referred  as  the  “loss  leaders ”.  One  example  of  loss  leaders  is  in 
the  market  of  digital  book  reader  (such  as  the  Kindle  and  IPad),  the  seller  may  price  the 
reading  device  at  a  low  price  so  as  to  make  more  money  on  the  sales  of  the  digital  books. 

Studying  the  problem  of  pricing  loss  leaders  is  formulated  as  an  open  problem  in  [16]; 
the  authors  asked:  "what  kind  of  approximation  guarantees  are  achievable  if  one  allows 
the  seller  to  price  some  items  below  their  margin  cost?"  Interestingly,  the  authors  found 
that  by  optimally  pricing  some  of  the  items  below  cost,  one  could  possibly  achieve  a  profit 
that  is  G(logra)  times  of  the  maximum  profit  under  the  positive  price  model.  The  problem 
of  pricing  loss  leaders  is  further  studied  by  Balcan  et  al.  in  [56].  They  introduced  two 
new  models:  the  coupon  and  discount  model.  Roughly  speaking,  the  discount  model  is 
the  item  pricing  problem  with  negative  profit  margin  allowed;  the  coupon  model  adds  an 
additional  assumption  that  a  seller’s  profit  is  at  least  0  for  the  entire  transactions  with 
each  customer.  The  same  O(log«)  “profitability  gap”  was  shown  under  these  models. 

In  this  work,  we  give  a  negative  result  for  pricing  loss  leaders.  In  particular,  we  show 
that  obtain  a  constant  approximation  for  item  pricing,  under  either  the  coupon  or  discount 
model,  is  NP-hard  assuming  the  Unique  Games  Conjectures;  our  hardness  result  holds 
even  for  the  very  simple  case  that  each  customer  is  only  interested  at  most  k  =  3  items. 
Our  result  should  be  compared  with  the  case  when  only  positive  prices  is  assigned,  there 
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is  an  ^-approximation  for  such  a  problem. 

8.1.2  Problem  definitions 

The  item  pricing  problem  is  also  called  the  VERTEX-PRICING  problem;  it  can  be  defined 
on  a  graph  where  each  customer  is  corresponding  to  a  hyperedge  and  each  item  to  price  is 
corresponding  to  a  vertex.  Let  us  start  by  formally  define  the  following  VERTEX-PRICING 
problem. 

Definition  8.1.1.  (VERTEX-PRICING)  A  vertex  pricing  problem  is  specified  by  the  tuple 

0 G(V,E),{be  |  ecE})) 

Here  G(V,E )  is  a  multigraph  where  each  vertex  Vi  e  V  represents  an  item.  Each  hyperedge 
e  g  E  represents  a  set  of  items  (vertices)  that  a  particular  customer  is  interested  with  the 
budget  be. 

When  the  corresponding  graph  is  £-hypergraph  (i.e.,  each  customer  is  interested  in  at 
most  k  items),  we  call  the  problem  VERTEX- PRICING^. 

Definition  8.1.2.  Given  a  VERTEX-PRICING  instance  J ,  and  a  price  function  p  :  V"  — »■  IK, 
the  profit  is  defined  as  follows: 

profitJ?(p)=  Y  price(e) 

be>price(e) 


where  price(e)  =  LueeP^)- 

When  we  restrict  the  range  of  the  price  function  p,  we  get  the  positive  price  model,  as 
well  as  the  discount  model  and  B-bounded  model  that  is  introduced  in  [56] 

Definition  8.1.3.  Given  a  instance  J?  of  VERTEX-PRICING: 

For  the  positive  price  model,  the  objective  function  is 


°P  *>poa 

For  the  discount  model,  the  objective  function  is 


max  profit  a(p). 

p:V~R+  * 


Optdisc  =  max  profit^(p) 

p:V-^R 


For  the  B-bounded  coupon  model,  the  objective  function  is 


OptR  =  max  profit  Ap) 

p:V— »[-!?, oo) 

The  B-bounded  model  applies  to  the  case  that  each  item  has  the  same  margin  cost 
B  and  the  seller  could  not  price  the  profit  margin  below  —B.  The  authors  in  [56]  also 
defined  the  coupon  model  which  assumes  that  the  profit  is  at  least  0  for  each  sale  with  the 
customer. 

Definition  8.1.4.  Given  a  instance  JP  of  VERTEX-PRICING,  the  profit  under  coupon  model 
is  defined  as 

profit^(p)=  Y  max(price(e),0) 

be>price(e) 
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and  the  objective  function  is  the  following: 


Opt„n,,n  =  max  profit"1" (p) 

C0  P  p-.V^R 

It  is  easy  to  see  the  following  relationship  among  these  models. 
Fact  8.1.5.  For  any  B  >  0  and  a  VERTEX-PRICING  instance  J? , 

0PVS  ^  OptB  ^  OptcPsc  ^  Optcoup. 


weighted  v.s.  unweighted  instance  we  can  also  define  weighted  version  of  the  above 
vertex  pricing  problem.  The  difference  is  that  every  edge  has  a  weight  we  and  profit(p)  is 
defined  to  be 

Y  we  price(e). 

be>price(e ) 

Similar  change  is  made  to  profit+(p). 

As  is  shown  in  [93]  (Lemma  2.2),  the  unweighted  VERTEX-PRICING  has  the  same  ap- 
proximability  as  the  weighted  VERTEX-PRICING.1  In  the  rest  of  the  thesis,  we  only  prove 
the  hardness  results  for  weighted  VERTEX-PRICING  while  the  same  hardness  result  also 
hold  for  unweighted  VERTEX-PRICING. 


8.1.3  Main  result 

Our  main  result  is  the  following  theorem: 

Theorem  8.1.6.  Assuming  the  UGC,  given  a  VERTEX-PRICING3  instance.  Then  for  any 
positive  integer  B,  it  is  NP-hard  to  distinguish  the  following  two  cases: 

•  OptB  >  O(logS); 

’  Optcoup  -  25. 

Using  fact  (8.1.5)  and  taking  B  =  2n('a\  we  get  the  following  corollaries: 

Corollary  8.1.7.  Assuming  the  Unique  Games  Conjecture,  for  any  constant  a  >  0,  VERTEX- 
PRICING3  under  the  coupon  model  is  NP-hard  to  a-approximate. 

Corollary  8.1.8.  Assuming  the  Unique  Games  Conjecture,  for  any  constant  a  >  0,  VERTEX- 
PRICING3  under  the  discount  model  is  NP-hard  to  a-approximate. 

Corollary  8.1.9.  Assuming  the  Unique  Games  Conjecture,  VERTEX-PRICING3  under  the 
B-bounded  model  is  NP-hard  to  O(lo gB)-approximate. 


8.2  Preliminaries 

8.2.1  Dictator  Test  for  vertex  pricing 

VERTEX-PRICING3  as  a  3-CSP  The  VERTEX-PRICING3  problem  can  be  viewed  as  a  3- 
CSPoveraset  ofvariables  pi,P2,---Pn  andaset  of  constraints  specified  by  6^  with  weight 

1Although  the  original  proof  only  applies  to  VERTEX-PRICING  2  with  positive  price,  it  is  straightforward 
to  adapt  their  proof  for  our  problem:  VERTEX-PRICING3  with  arbitrary  price. 
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Wijk 2-  Let  us  first  think  of  the  VERTEX-PRICING  problem  under  the  discount  model,  the 
payoff  function  on  bijk  is 

reven\ie(pi,pj,pk,Wijk)  =  Hpi+Pj  +  Pk  <  bijk)(Pi+ Pj  +  Pk)- 

The  goal  is  to  find  :  [/z]  — >-  IR  to  maximize  the  overall  profit: 

Y  wijk  '  revenuefp  i,pj,pk,bijk). 
i,j,k 

By  the  rule  of  thumb,  we  need  to  design  a  Dictator  Testof  the  following  form.  It  is  a 
test  for  functions  f  :  [pY1  — ►  IR  with  the  following  two  properties:  (i)  Dictator  functions  — 
i.e.,  functions  of  the  form  h{xf)  for  a  particular  function  :  [/o]  — ►  US  and  each  i  £  \n\  —  pass 
the  test  with  high  profited  f)  =  c;3  (ii)  Functions  f  that  is  of  “low  noisy  influence”  on  each 
coordinate  pass  the  test  with  low  profited f)  =  s.  Then  roughly  speaking,  by  the  tech¬ 
nique  of  [99],  we  can  show  that  assuming  the  UGC,  it  is  NP-hard  to  distinguish  whether  a 
VERTEX-PRICING3  instance  with  profit  above  c  or  below  s  (which  directly  implies  a  hard¬ 
ness  of  approximation  ratio  s/c). 

Above  is  the  description  of  the  Dictator  Testfor  the  discount  model.  As  for  the  coupon 
model,  the  Dictator  Testis  essentially  of  the  same  except  the  pay  off  function  is  defined  as 

revenue+ (pi,pj,pk, w^k)  =  Hpi  +Pj+Pk  ^  Wijk)-max(pt  +pj  +  pk, 0). 

and  the  profit  of  a  function  f  is  defined  as 

profited/)  =  EX;3,)Z;U,  [revenue"1"  (/(x)  +  f(y)  +  f(z),  w)l 

In  the  rest  of  the  work,  we  first  design  and  analyze  a  proper  Dictator  Testfor  VERTEX- 
PRICING3.  Then  we  use  the  idea  from  [99]  to  construct  a  reduction  from  the  UNIQUE- 
Games  problem  to  the  VERTEX-PRICING  problem.  We  want  to  emphasize  here  that  we 
can  not  directly  use  [99]  as  the  variables  in  VERTEX-PRICING  is  unbounded.  To  circumvent 
that,  we  need  to  modify  the  definition  of  “low  noisy  influence”  function  correspondingly. 

8.2.2  Mathematical  tools 

One  major  advanced  tool  we  need  in  our  analysis  is  the  following  theorem  that  is  essen¬ 
tially  similar  to  invariance  principle  stated  in  Chapter  2. 

Theorem  8.2.1.  Let  (G  =  [/)]*,  p)  be  a  finite  probability  spaces  with  the  following  properties: 

•  a  =  (ai,a2,. . .  ,af)  ~  p  are  pairwise  independent. 

•  a  =  minaeQ  p(a)  >  0. 

For  tj  >  0  and  f  =  :  Qn  —  [0,  l]f  be  function  satisfying  that  for  any  i  e  [n\,j  e  \k ] 

and  some  constant  r  >  0, 

Inf1“VJ  <  t 

2strictly  speaking,  for  each  ( i,j,k ),  there  can  be  different  bijk  with  different  weights  w lJk ■ 

3usually  hit)  =  t  for  most  of  the  other  results  in  the  thesis. 
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Then 


Btn  Ti-rjf^]  -  n  E|/(i)]  <  TCo??/log(l/a)) 
i=l  i=l 

Here  Cq  is  a  constant  that  only  dependent  on  t.  The  expectation  is  taken  with  respect  to  the 
product  distribution  ( CL,p)n . 

Roughly  speaking,  above  theorem  states  that  for  calculating  the  product  of  t  different 
functions,  if  these  functions  do  not  have  big  noisy  influence  on  each  coordinate,  then  the 
product  of  them  is  the  essentially  the  same  under  an  pairwise  independent  distribution  or 
the  fully  independent  distribution. 


8.3  Dictator  Test  for  vertex  pricing 

8.3.1  Description  of  the  Dictator  Test 

To  introduce  our  Dictator  Testas  well  as  analyzing  it,  first  let  us  define  the  following  dis¬ 
tributions  @o,@i,@2  on  (x,y,z)  £  [ pT  x  n"=1  x  IIf=1- 

Definition  8.3.1.  (Distribution  2>o)  Choose  x,y  uniform  randomly  and  independently  from 
El for  each  i,  we  have  that 

•  Zi  =p-(xi  +  yi)ifxi+yi  <p. 

•  Zi  =  2p-(xi+yi)ifp<Xi+yi<2p. 

By  definition,  we  know  that  x;  +  yi  +  zt  =  0  mod  p  for  each  i.  One  important  property 
of  above  distribution  is  that  (xi,yi,zf)  for  each  i  are  pairwise  independent. 

Definition  8.3.2.  (Distribution  2>i)  For  x,y,z  ~  S>o„  Let  x',y',z'  be  1-e  correlated  with 
x,y,z.  We  call  the  corresponding  distribution  on  x' ,y' ,z'  as 

Definition  8.3.3.  (Distribution  @>2)  Choose  x,y,z  uniform  randomly  and  independently 
from  HU- 

Following  is  the  Dictator  Testfor  vertex  pricing.  Here,  we  use  1  to  indicate  the  all  “1” 
vector:  (1,1,...,  1)  £  IR”. 

Definition  8.3.4.  (Dictator  Test  ST)  For  x' ,y' ,z'  generated  from  @i,  a  k  randomly  chosen 
from  \_s/p],  we  generate  a  VERTEX-PRICING  constraint  among  f(x'),f(y'),f(z'®p  •  1) 

with  budget  [^p/k\.  We  define 

profit^-i/)  =  Ey y  ^ [revenue  ((/ (xr),  f  {y'),  f(z'®p  [^p/k J  -1),  \_\fplk J )]. 


and 


profit^-(/)  =  Eyy)Z'^ [revenue"1"  [(f{x'),f{y'),f(z  ®p  [^/p/k \  -1),  [\fplk J)]. 

For  the  purpose  of  analyzing  ST ,  we  also  define  the  following  Test  FT' . 

Definition  8.3.5.  (Test  ST')  For  x' ,y' ,z'  generated  from  @2;  randomly  choose  k  e  [y'p].  We 
generate  a  VERTEX-PRICING  constraint  among  f(x'),f(y'),f(z')  with  budget  [\fp/k J). 

We  define 


profit g-,(f)  =  ExiytZ'tk [revenue  (( f(x'),f(y'),f(z '),  [\fplk J)]. 
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and 


profit =  ^x',y',z',k [revenue"1"  (( f(x'),f(y'),f(z '),  [y/p/k J )]. 

We  claim  that  for  the  Dictator  Tester',  it  has  the  following  property: 

Proposition  8.3.6.  For  any  function  f  :  [ p]n  — *  IR,  profit^-,!/)  ^  1. 

Proof.  Notice  that  for  each  triple  ( x',y',z ')  ,  if  there  exists  k'  such  that  [^fp/ik'  +  1)J  < 
fix')  +  f(y')  +  fiz')  <  [y/p/k!  \ .  Then  the  profit  on  (x'  ,y'  ,z')  is  at  most 

k' if  (x)  +  fiy)  +  f(z))  <  k'  y/p/k' /y/p  <  1. 

If  fix')  +  fiy')  +  fiz')  <  0  or  fix')  +  fiy')  +  fiz')  >  yf p ,  then  the  profit  on  x',y',z'  is  0. 
Condition  on  every  triple  ix'  ,y',z'),  the  expect  profit  associated  with  fix'),  fiy'),  fiz')  is 
at  most  1,  therefore  the  overall  profit  is  also  at  most  1.  □ 


8.3.2  Analysis  of  the  Dictator  Test/ST 

We  prove  the  completeness  (Theorem  10.3.5)  and  soundness  (Theorem  10.3.6)  for  ST  in 
this  section. 

Theorem  8.3.7.  (Completeness  of  ST)  For  function  fix)  =  xi  -pi  3  for  x  e  n"=1,  profited/)  > 
n(logp). 

Proof  Suppose  x',y',z'  ~  2>i  is  generated  as  1  -  e  copy  of  x,y,z  ~  @o- 

Since  Xj,y*  are  randomly  generated  from  [p],  we  know  that  ^/p  <  xi  +  yi  <  p  with  prob¬ 
ability  at  least  1/3.  When  this  happen,  Xi  +  yt  +  zt  =  p  and  zt  <  p  -  y/p.  Also  as  each  of  the 
Xi,yi,Zi  is  reset  to  a  random  number  with  probability  e  =  1/p,  we  know  that  with  probabil¬ 
ity  1/3  -3 Ip,  x't  =  Xi,y't=  yi,z'i  =  Z[  and  we  have  that  x\  +y't  +z't  =  p  and  z'  <  p-  yjp.  We  call 
these  ix' ,y' ,z')  “good”. 

Then  for  “good”  ix'  ,y'  ,z'),  if  we  choose  fit)  =  x;  -p/3,  we  know  that  fix')  +  fiy')  +  fiz'®p 
[v rp/k\)  =  Xi  +  yi+Zi+  [yfp/k J  -p=  \  y/p/k J.  Therefore, 

revenue  [if  ix'),  f  iy'),  fiz'  ®p  [^/p/k\  -1),  [s/p/k  J)  =  [y/p/k  J . 


Therefore  for  “good”  ix' ,y' ,z'),  the  associate  is  at  least 


(1/3-3/p)- 


giS  Wm\ 

Vp 


(1/3-3/p)- 


Vp 


(1/3  -  3/p)  -(logy/p  -  1)  >  l/81ogp 


for  large  enough  p . 

We  also  need  to  show  bound  the  profit  (loss)  on  those  "bad"  x',y',z'  such  that  for  some 
k 

fix')  +  fiy')  +  fiz'  ®p  [Jp/k\ )  <  0. 

This  could  happen  for  x',y',z'  generated  from  the  following  two  cases: 


1.  At  least  one  of  the  x'.,y'.,z'.  is  reset,  this  happens  with  probability  at  most  3/p. 

2.  None  of  the  x'i,y'i,z\  is  notreset.  Since  Xi+yi+Zi  =  p,2p,  to  make  /-(x,-)+/‘(y()+/(z'©p 
y/p/k)  <  0,  we  know  that  we  must  have  X;  +  yi  +  zi  =  p  and  zt>  p  -  yfp/k.  We  must 
then  have  xi  +  yi  <  y/p/k.  We  know  that  Pr(x;  +yi<  y/p/k  <  Pr(x;,y;  <  y/p/k)  =  — 
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Therefore,  we  can  have  negative  profit  on  (x',y',z')  occur  with  probability  at  most  4 Ip. 

As  we  know  that  fix )  =  x; -p/3  >  -p/3, therefore,  f  ix')  +  f  (y')  +  f  iz'  +  [^p/k J)  >  -p,  overall, 
we  lose  at  most  4/p  ■  p  =  -4  on  those  “bad”  ix' ,y' ,z'). 

Overall,  for  fix)  =  X{  -  p/3,  we  must  have  that  profited/-)  >  1/8  -logp  -  4  =  Q(logp)  for 
sufficient  large  p .  □ 

Now  we  state  the  soundness  statement.  As  f  :  [ p]n  —>  R  is  not  bounded,  we  define  its 
influence  on  a  transformation  of  f  as  follows.  We  define  f  be  the  integral  part  of  f ,  being 
[f\ .  We  also  define  f  £  [p]  and  is  uniquely  defined  by  f'  =  f  mod  p.  By  abuse  of  the 
notation,  we  also  write  f'  :  [ p]' 1  — »  {-1,1}P  with  f'^  being  the  indicator  function  1  if  =  i 
mod  p).  The  influence  of  f'  is  defined  with  respect  to  its  vector  form. 

Theorem  8.3.8.  (Soundness  of  ST)  For  Tcoi/piog(p)  <  x/p4  and  any  function  f  :  LpJ"  — »  K 
such  that 

maxln^-6/^  <  t, 

i 

we  have  that  profited/)  <  7. 

Proof  Notice  that  the  soundness  statement  is  proved  for  the  coupon  model  which  auto¬ 
matically  gives  an  upper  bound  for  profited/-). 

First  let  us  prove  above  statement  under  the  assumption  that  f  £  [p].  Then  f'^  =  lif  = 
i).  We  also  use  pa  to  denote  ~ExE[p\n[f'aix)\.  We  can  arithmetize  and  bound  the  objective 
function  profited/)  in  terms  of  as  follows: 

profit  +g-if)=  Y  E  Vf'a(x')f'hiy')f'ciz'®p  [y/p/k\-  l)ia  +  b  +  c)] 

0<a+6+c<  Lv/p/^J  ,a,b,ce[p]x  ,z 

=  E  [  £  Ti-ef'aix)Ti-ef'biy)Ti-ef'ciz  ©p  [i/p/&J  •  l)(a  +  6  +  c)] 

x,y,z~3>!,k  o<a+fo+c<  ,a,b,ce[p] 

=  E[  X  E  Ti-ef'aix)Ti-ef'biy)Ti-ef'ciz  ©p  [i/p/^J  •  l)(a  +  6  +  c)]. 

b  0<a+b+c<\_^/p/k\,a,b,ce[p]x’^’z 

Notice  that  Inf 1i~ef'a  <  Inf )~ef'  <  r  for  i  £  [n\,a  £  [p].  and  x,y,z  ~  are  pairwise 
independent,  by  Theorem  8.2.1  (with  minimum  probability  a  =  1/p),  we  can  plug  in  inde¬ 
pendent  x,y,z  ~  3>2  with  additive  error  bounded  by  Tcoi/(piogp)  <  \/p^  That  is 

profited/)  <  E[  £  E  (f'aix)f'biy)f'ciz®Pl^/p/k\-l)  +  Vp4)ia  +  b  +  c)] 

b  0<a+6+c<  Lv/p/^J  ,a,b,c£[p]X’~>'’Z  ^2 

<E^e[v^][  Y  Pahbflciz®p[\/plk\-l)ia  +  b  +  cy\  +  l 

0 <a+b+c<  [jp/k\  ,a,6,ce[p] 

The  last  inequality  uses  the  fact  that  a  +  b  +  c  <  ^/p  and  there  are  at  most  p3  terms  in 
the  summation. 

A  important  observation  is  that  since  2  is  independent  of  k,  therefore  the  random  vector 
variable  2  ©p  [^/p/^J  is  also  independent  of  the  random  variable  k.  Also,  the  distribution 
on  2  ©p  [y/p/^J  is  uniformly  random  over  [p]”. 

Therefore,  we  can  further  bound  profit  +Tif)  by 

E*[  Y  paPbPcia  +  b+c)+  1. 

0 <a+b+c<  [Fp!k\  ,a,6,ce[p] 
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As  for  the  term 


E*e[Vp][  E  Pad-bh-M  +  b  +  c). 

0 <a+b+c<  Lv/p/Aj  ,a,b,c£[p ] 

It  is  just  profited/-)-  By  Proposition  8.3.6,  we  know  that  profit^-, if)  <  1.  Overall,  we 
bound  profited/-)  by  2  when  f  e  [p]. 

Following  two  observation  is  useful  in  our  analysis. 

Observation  8.3.9.  Above  proof  also  works  even  for  randomized  function  fix)  £  [ p]  spec¬ 
ified  by  f'  in  the  following  way:  for  each  x,  with  probability  f'l(x),  f  outputs  i.  Here 
Y.fn(x)  =  1  for  any  x  £  [p]n. 

Observation  8.3.10.  For  any  9  e  [R,/1  e  [p] ,  we  can  also  bound  the  profit  on  function  f  -6. 
That  is  profited/-  -6)  <2. 

To  see  this,  simply  notice  that 

profit  +g-(f-6)=  E  E  [faix')fbiy')fciz'®p[s/p/k\-l)ia+b+c-39)] 

30<a+6+c<30+  \_^~ptk\  ,a,b,c£[p]x  ,z 

and  then  we  use  the  same  proof  and  show  that  profited/-  -9)<  profit  +T,if  -  6)  +  1  <  2. 

Now  we  need  to  handle  the  case  that  f  is  not  necessary  a  bounded  integral  function. 
Recall  that  f  =  [f\.  First,  we  notice  that  f  <f  +  1,  therefore,  we  have  that 


revenue"1"  [ifix'),fiy'),fiz'®p  [fp/k\  -1),  [s/p/k J) 

<  revenue+  [(fix'),  fiy'),  fiz ’  ©p  [s/p/k J  •  1),  [s/p/k J )  +  3.  (8.1) 

We  then  have  that 

profited/1)  =  ^x’,y',z',k [revenue +  [ifix'),fiy'),fiz'®p  [s/p/k J  •  1),  [s/p/k\]\ 

<  Ex^y)2/^ [revenue +  (( fix'),  fiy'),  fiz'  ®p  [fp/k\  •  1),  [s/p/k\)  +  3]  <  profited/)©  3. 

The  next  step  we  show  that 

profited/)  <  profit fif')  +  profited/'  -  p/3)  +  profit /-if'  -  2p/3)  (8.2) 

By  definition  of  f',  we  know  that 

fix)  +  fiy)  +  fiz)  =  f'ix)  +  f'iy)  +  f'iz)  mod  p . 

Therefore,  if  fix)  +  fiy)  +  fiz)  <  [^/p/^J  for  some  k,  it  must  be  the  case  that 

f'ix)  +  f'iy)  +  f'iz)  £  [0,  [s/p/k\\  or  [p,p  +[fp/k  J]  or  [2p,2p  +  [fp/k\ ) . 
Therefore, 

revenu  e+ifix),fiy),fiz),  [s/p/k  J)  <  revenu  e+if'ix),  f'iy),  f'iz),  [s/p/k  J)) 
+revenue+(/'(x)  -  p/3,  fiy)  -  p/3,  f'iz)  -  p/3,  [sfp/k  J )) 

+ revenue"1"  (/'(x)  -  2p/3,  f'iy)  -  2p/3,  f'iz)  -  2p/3,  [sfp/k  \ )). 

This  proves  (8.2).  And  by  Observation  8.3.10,  we  have  that 

profited/-)  <  profited/)  +  l<3-2  +  l<7. 

□ 
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8.4  The  reduction  from  the  UNIQUE -Games 


In  this  section  we  show  how  to  use  our  Dictator  Test  3~  to  obtain  our  main  result,  Theo¬ 
rem  9.1.1.  First  let  us  recall  the  definition  of  the  Unique-Games. 

Definition  8.4.1.  For  L  e  N,  a  Unique-Games^  instance  consists  of  a  bipartite  graph 
having  vertex  sets  U,  V  and  edge  set  E,  together  with  a  bijective  constraint  nv,u  :  [L]  — » 

[L]  for  each  ( u,v )  £  E.  In  addition,  each  edge  e  £  E  has  a  nonnegative  weight  puv,  with 
'L{u,v)eE Puv  =  1.  The  algorithmic  task  is  to  find  an  assignment  A  :(U  uV)  -*  [L]  such  that 
the  total  weight  of  satisfied  constraints  is  as  large  as  possible.  Here  we  say  that  A  satisfies 
the  constraint  nuv  ifnuv(A(u ))  -  A(v). 

The  following  equivalent  version  of  the  UGC  due  to  Khot  and  Regev  [103,  Lemma  3.6]: 

Theorem  8.4.2.  Assume  the  UGC.  For  all  small  (,y  >  0,  there  exists  L  £  N  such  given  an 
unweighted  Unique-Games^  instance  <£  =  {U ,V ,E ,{jiu,v)(UtV)eE)  which  is  U -regular,  it  is 
NP -hard  to  distinguish  the  following  two  cases: 

1.  There  is  an  assignment  A  :  (UuV)  — »  [L]  and  a  subset  U'  <^U  with  \U'\/\U\  >  l-£  such 
that  A  satisfies  all  constraints  incident  on  U'. 

2.  There  is  no  assignment  A  that  satisfies  more  than  y  fraction  of  the  constraints. 

We  make  the  following  reduction  from  a  Unique-Games  instance  ^  to  a  VERTEX- PRICING3 
instance  JG  The  reduction  is  very  similar  to  the  one  in  [99].  Given  the  UNIQUE-GAMES^ 
instance  ^  =  (U,  V,E,{nuv}),  the  reduction  produces  a  weighted  VERTEX-PRICING  instance 
with  variable  set  V  x  [p]L.  We  think  of  an  price  assignment  F  to  these  variables  as  a 
collection  of  functions  F  =  {fv  :  [p]L  — *  M],  one  for  each  v  eV.  We  now  define  the  instance 
according  to  the  following  procedures. 

Reduction  from  UNIQUE -GAMES 

1.  Choose  ueU  randomly. 

2.  Choose  3  of  u’s  neighbor  v±,V2,V3  randomly  (with  replacement). 

3.  Generate  ( x,y,z )  ~  3> 2  and  k  randomly  from  [^/p]. 

4.  Add  a  constraint  among  fVl(nVl’u(x)),fV2(nV2’u(y)),fV3(nV3,u(z)  +  [y/p/k J)  with 
budget  l^fp/k\ . 

Here,  for  x  £  [p]L  and  mapping  n  :  [L]  — ►  [L],  we  denote  n{x)  £  [p]L  as  the  permutation 
of  x’s  coordinate  according  to  i;  i.e.,  n(x)i  -  xna). 

We  claim  that  above  reduction  have  the  following  property. 

Theorem  8.4.3.  For  (  =  1/p,  t  satisfies  that  TCaplogp  <  1/p4  and  y  =  r2/p4,  above  reduction 
have  the  following  property: 

•  (Completeness.)  If  statement  1  in  Theorem  8.4.2  holds  for  CS,  then  there  is  a  price 
assignment  F  such  that  profit j ^(F)  =  O(logp).  In  addition,  the  price  assigned  on 
each  variable  is  p -bounded,  i.e.,  with  value  >  -p. 

•  (Soundness.)  If  there  is  non  assignment  for  <£  that  satisfies  more  than  y  fraction  of 
the  edges,  then  for  every  price  assignment  F  such  that  profit ^(F)  <  25. 

By  combining  Theorem  8.4.3  with  Theorem  8.4.2,  and  set  p  -B,  we  prove  Theorem  9.1.1. 
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8.4.1  Proof  of  Theorem  8.4.3 

It  remains  to  prove  Theorem  8.4.3. 

Proof.  (Completeness)  To  prove  the  completeness  part  of  Theorem  8.4.3,  suppose  that  as¬ 
signment  A  :  V  — ►  [L]  and  subset  U'  c{/  are  as  in  statement  1  of  Theorem  8.4.2.  Define  an 
price  assignment  F  for  J"  by  taking  fv(x)  -  xa(v)  -  p/3.  Then  by  definition  and  the  property 
of  A,  for  u'  e  U' ,  fVi(nVi,u  (x))  =  xa(u')  -p/3  for  i  =  1,2,3.  Thus  by  the  completeness  of  the 
Dictator  Test(Theorem  10.3.5),  assignment  F  will  have  profit  at  least  G(logp)  conditioned 
on  u!  g  U  is  picked.  As  for  the  case  that  u  £  U'  is  picked,  we  lose  a  negative  profit  bounded 
by  -p  .  Overall,  we  have  that  profit^(F)  >  (1  -  OO(logP)  +  (p.  Notice  that  we  choose 
(  =  1/p,  therefore,  profit j^CF)  >  G(logp).  In  addition,  we  know  that  the  assignment  on 
each  fv  is  above  -p/3. 

(Soundness)  We  prove  the  soundness  statement  by  contradiction.  Suppose  that  some 
assignment  F  have  profit^  (F)  >  25,  we  will  exhibit  a  assignment  to  the  Unique  Games 
instance  ^  that  satisfies  y  fraction  of  the  edges.  Notice  that  the  maximum  profit  on  each 
constraint  is  at  most  yp  >  then  by  an  average  argument,  we  must  have  for  at  least  1  / s/p 
of  the  vertex  u  e  U  picked  in  the  first  step,  the  expected  profit  on  these  u  is  above  24. 

Let  us  call  these  u  “good”.  Write  N(u)  as  the  neighbor  of  u.  By  definition,  for  a  fixed 
’’good”  u,  we  know  that 

E vi,V2,V3£N(u),x,y,z~3>2  [revenue+  [fvi{nu’vix),  fV2(nu’V2(y)),  fV3(nu’V3(z)  +  [s/p/k J),  [y/p/k J)]  >  24. 

Similar  to  the  analysis  of  Theorem  10.3.6  ,  we  define  fv  =  \_fv\  and  introduce  f'v  e  [p] 
such  that  f'v  =  fv  mod  p,  although  we  also  write  f[  as  (p]re  — » {0, 1}P  with  its  i-th  coordinate 
indicate  whether  f[  is  i.  We  call  the  assignment  corresponding  to  {fv}veV  as  F  and  the 
assignment  corresponding  to  {f'v}veV  as  F' . 

By  the  proof  of  (8.1),  we  know  that 

profit^(F)  >  profit^ (F)  -  3  >  21 

and  by  the  proof  of  (8.2),  we  have  that 

profit  ^(F')  +  profit^dF'  -  p/3)  +  profit^  (F'  -  2p/3)  >  profit  ^(F)  >  21. 

Therefore,  one  of  profit^(F'), profit^. (Fr -p/3), profit^(F'  -  2p/3)  should  be  above  7. 
Assume  that  profit^. (F'  -  p/3)  >  7.  (The  other  2  cases  are  similar) 

We  know  then 

profit^(F'-p/3) 

=  E  x,y,z,k„Vl,V2,V3[  E  _  [/•,7Tryi-“(x))/-t;62(^y2’M(y))/-i;c3(7r^-“(2)+  [pp/fcj  -1  Ka+b+c-p)] 

p<a+b+c<p+  \_\/plk\ 

=  VX,y,Z,kl  E  E!,iejv(lt)[/’“1(7rl'1’“(x))]  •  ^v2eN(.u)\-fv2^TtV2’U^y^' 

p<a+b+c<p+  Lv/p/£j 

K3eN(u)\-fv3(xV3’u(z+  [Vp!k\  -l))](a  +  6  +  c-p)]  (8.3) 


167 


If  we  define  flu  -  FiveN(u)[f*(nv,u{x)\  for  i  e  [p],  then  we  have  that 
profiteer'  -  p/3)  =  'E,x,y,z,hi  Y  E^1^2 ,v3[fu(nVl’U(x)) 

p<a+b+c<p+  l^p/k\ 

■  fi{nv*’u<j))fi(nv*’u(z)  +  [s/p/k J  ■  l)(a  +  b  +  c-p)]>  7.  (8.4) 

Denote  fu(x )  =  .  It  is  easy  to  check  that  T.fu  =  1-  Then  fu  can  be 

viewed  as  a  randomized  function  that  on  a  particular  x,  it  output  i  with  probability  f’u(x). 
Then  (8.4)  is  equal  to  the  profit  of  the  Dictator  Test/T  on  fu  -p/3, being  profit ^-(fu  -p/3)  > 
7. 

We  know  then  by  a  contrapositive  statement  of  Theorem  10.3.6  along  with  observation 
8.3.10  and  Observation  8.3.9,  there  must  be  some  i  such  that  Inf \~efu  -  t. 

Then  by  Fact  7.3.3,  we  know  that 

r  <  Inf }~efu  <  Ev£N{u)[Irf}-£fv(nv’u(x))l 

By  an  averaging  argument,  since  Inf ifv(nv,u(x))  =  Y.je[p]^i  -  P,  for  ^  fraction  of 
the  v  eN(u),  we  have  that  Inf ifv(nv,u(x))  =  Inf1"6  r.fv  >  r/2. 

Now  consider  choosing  a  random  assignment.  Let  Su  =  {i  \  Inf ]~efu  >  t}  and  Sv  be 
{i  |  Inf \~efu  >  t/2}.  By  fact  7.3.4,  we  know  that  |S„|  <  p2/r. 

The  assignment  would  be  randomly  set  a  label  in  Su  for  u  and  a  label  in  Sv  for  v. 
Then  it  is  easy  to  see  for  good  vertex  u  and  any  of  its  coordinate  i  £  Su,  r/2 p  fraction 
of  its  neighbor  will  have  a  matching  coordinate  j  =  (nv,u)~l(i)  in  Sv.  Therefore  above 
assignment  satisfy  at  least  1/|S„|  -  t/2 p  fraction  of  the  edges  for  “good”  u.  We  know  that 
there  is  at  least  a  V \Jp  fraction  of  the  u  is  good  .  By  the  regularity  of  the  graph  at  the  U 
side,  we  know  that  such  a  labeling  strategy  satisfies  at  least  l/^/p  •  (r/p2)r/2p  >  r2/p4  =  y 
fraction  of  the  edges. 

□ 
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Part  III 

Hardness  of  Learning 
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Chapter  9 

Hardness  of  Learning  Monomials 
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9.1  Introduction 


Monomials  (conjunctions),  decision  lists,  and  halfspaces  are  among  the  most  basic  concept 
classes  in  learning  theory.  They  are  all  long-known  to  be  efficiently  PAC  learnable,  when 
the  given  examples  are  guaranteed  to  be  consistent  with  a  function  from  any  of  these 
concept  classes.  However,  in  practice  data  is  often  noisy  or  too  complex  to  be  consistently 
explained  by  a  simple  concept.  Dealing  with  noise  and  other  inconsistencies  is  thus  one  of 
the  most  significant  issues  in  learning  theory.  In  this  chapter,  we  prove  a  strong  hardness 
result  for  agnostic  learning  of  monomials  using  halfspaces,  or  equivalently  the  MON-HS- 
MA  problem. 

Theorem  9.1.1.  The  problem  of  MON-HS-MA  (1  -  e,  1/2  +  c)  is  NP-/iard. 

Note  that  this  hardness  result  is  essentially  optimal  since  it  is  trivial  to  find  a  hypoth¬ 
esis  with  agreement  rate  1/2  —  output  either  the  function  that  is  always  0  or  the  function 
that  is  always  1. 

Since  the  class  of  monomials  is  a  subset  of  the  class  of  decision  lists  which  in  turn 
is  a  subset  of  the  class  of  halfspaces,  our  result  implies  an  optimal  hardness  result  for 
proper  agnostic  learning  of  decision  lists.  In  addition,  a  similar  hardness  result  for  proper 
agnostic  learning  of  majority  functions  can  be  obtained  via  a  simple  reduction. 


9.2  Proof  Overview 

By  the  rule  of  thumb,  the  first  step  is  to  construct  a  dictator  test  such  that  a  dictator  mono¬ 
mial  passes  with  probability  1  -e  while  a  non  dictator  test  passes  the  test  with  probability 
1/2  +  e. 

We  prove  Theorem  9.1.1  by  exhibiting  a  reduction  from  the  ^-Label-Cover  problem, 
which  is  a  particular  variant  of  the  Label-Cover  problem.  The  ^-Label-Cover  problem 
is  defined  as  follows: 

Definition  9.2.1.  For  k  ^  2,  an  instance  of  ^-Label-Cover  5£{G(V ,E),M,N ,{nv,e\e  £ 
E,v  £  e})  consists  of  a  k-uniform  connected  (multi-)hypergraph  G(V,E)  with  vertex  set  V 
and  an  edge  set  E;  a  set  of  functions  {nVi,e  }*=1;  and  a  set  of  labels  M  =  {1,2,. ..  ,M}  for  some 
positive  integers  M.  Every  hyperedge  e  -  (iq, . . . ,  vf)  is  associated  with  a  k-tuple  of  projection 
functions  {nVi’e}ki=1  where  nVi,e  :  [M]  — >  [AT]. 

A  vertex  labeling  si  is  an  assignment  of  labels  to  vertices  si  :  V  — ►  [<£R].  A  labeling  sd  is 
said  to  strongly  satisfy  an  edge  e  if  nVi,e(si(vi))  -  nv i,e (si (v _/)))  for  every  vi,Vj  £  e.  A  labeling 
L  weakly  satisfies  edge  e  if  nVi,e(si(vi))  =  nVj,e{si{vj)))  for  some  Vi,vj  ee,  vtf  Vj. 

The  goal  in  Label-Cover  is  to  find  a  vertex  labeling  that  satisfies  as  many  edges 
(projection  constraints)  as  possible. 

For  the  sake  of  clarity,  we  first  present  the  proof  of  Theorem  9.1.1  assuming  the  Unique 
Games  Conjecture.  Consequently,  we  will  be  interested  in  the  ^-UNIQUE  Label-Cover 
problem  which  is  a  special  case  of  ^-Label-Cover  where  M  =  N,  and  all  the  projection 
functions  {nv,e\v  £  e,e  £  E)  are  bijections.  The  following  inapproximability  result  for  k- 
UNIQUE  Label-Cover  is  equivalent  to  the  Unique  Games  Conjecture  of  Khot  [98]. 
Conjecture  9.2.2.  For  every  constant  rj  >  0  and  a  positive  integer  k,  there  exists  R o  such 
that  for  all  positive  integers  R  >  R o,  given  an  instance  5£{G{V ,E),  l,R,{nv,e\e  £  E,v  e  e})  it 
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is  NP -hard  to  distinguish  between, 

•  strongly  satisfiable  instances:  there  exists  a  labeling  sd  :V  — ►  [R]  that  strongly  satis¬ 
fies  1  -krj  fraction  of  the  edges  E. 

•  almost  unsatisfiable  instances:  there  is  no  labeling  that  weakly  satisfies  fraction 
of  the  edges. 

A  proof  of  the  equivalence  between  above  conjecture  and  Unique  Games  Conjecture 
can  be  found  in  [103]. 

Given  an  instance  5£  of  ^-UNIQUE  Label-Cover,  we  will  produce  a  distribution 
over  labeled  examples  such  that  the  following  holds:  if  5£  is  a  strongly  satisfiable  instance, 
then  there  is  a  disjunction  (an  OR  function)  that  agrees  with  a  randomly  chosen  example 
with  probability  at  least  1  -  e,  while  if  5£  is  an  almost  unsatisfiable  instance  then  no 
halfspace  agrees  with  a  random  example  from  Q)  with  probability  more  than  \  +  e.  Clearly, 
such  a  reduction  implies  Theorem  9.1.1  assuming  the  Unique  Games  Conjecture  but  with 
disjunctions  in  place  of  conjunctions.  De  Morgan’s  law  and  the  fact  that  a  negation  of  a 
halfspace  is  a  halfspace  then  imply  that  the  statement  is  also  true  for  monomials  (we  use 
disjunctions  only  for  convenience). 

Let  ££  be  an  instance  of  ^-UNIQUE  Label-Cover  on  hypergraph  G  =  ( V,E )  and  a 
set  of  labels  [R].  The  examples  we  generate  will  have  |Vj  x  R  coordinates,  i.e.,  belong  to 
{0,  These  coordinates  are  to  be  thought  of  as  one  block  of  R  coordinates  for  every 

vertex  v  eV.  We  will  index  the  coordinates  of  x  £  {0,  as  x  =  (x„  )VeV,ee[RV 

For  every  labeling  sd  :  V"  — ►  [R]  of  the  instance,  there  is  a  corresponding  disjunction  (OR 
function)  over  {0,  l}^xi?  given  by, 


h(x)  =  \Jx(f{v)\ 

V 

Thus,  using  a  label  £  for  a  vertex  v  is  encoded  as  including  the  literal  x[f '  in  the  disjunction. 
Notice  that  an  arbitrary  halfspace  over  {0,  l}IUx-R  nee(j  not  correspond  to  any  labeling  at 
all.  The  idea  would  be  to  construct  a  distribution  on  examples  which  ensures  that  any 
halfspace  agreeing  with  at  least  |  +  e  fraction  of  random  examples  somehow  corresponds 
to  a  labeling  of  Sd  weakly  satisfying  a  constant  fraction  of  the  edges  in  ££ . 

Fix  an  edge  e  =  ( v\ , . . . ,  Vk).  For  the  sake  of  exposition,  let  us  assume  nVi,e  is  the  identity 
permutation  for  every  i  £  [£].  The  general  case  is  not  anymore  complicated.  For  the  edge 
e,  we  require  a  distribution  on  examples  3>e  with  the  following  properties: 

•  All  coordinates  x[f*  for  a  vertex  v  £  e  are  fixed  to  be  zero.  Restricted  to  these  exam¬ 
ples,  the  halfspace  h  can  be  written  as  h(x)  =  sgn(Y.ie[k](lvvi,Xvi)  ~  0). 

•  For  any  label  £  £  [R],  the  labeling  sd(v i)  =  . . .  =  sd(vk )  =  £  strongly  satisfies  the  edge 
e.  Hence,  the  corresponding  disjunction  Vje^]X^)  needs  to  have  agreement  ^  1  —  e 
with  the  examples  from  2>e. 

•  There  exists  a  decoding  procedure  that  given  a  halfspace  h  outputs  a  labeling  Lh  for 
J2?  such  that,  if  h  has  agreement  ^  \  +  e  with  the  examples  from  3>e,  then  Lh  weakly 
satisfies  the  edge  e  with  non-negligible  probability. 

For  conceptual  clarity,  let  us  rephrase  the  above  requirement  as  a  testing  problem. 
Given  a  halfspace  h,  consider  a  randomized  procedure  that  samples  an  example  (x,b) 
from  the  distribution  3>e,  and  accepts  if  h(x)  =  b.  This  amounts  to  a  test  that  checks  if 
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the  function  h  corresponds  to  a  consistent  labeling.  Further,  let  us  suppose  the  halfspace 
h  is  given  by  h(x)  =  sgn(Y.vev(wv,xv)  -6).  Define  the  linear  function  lv  :  {0,1}^  — *■  [R  as 
lv(xv )  =  (wv,xv).  Then,  we  have  h(x)  =  sgn(EVeV^v(xv)-0). 

For  a  halfspace  h  corresponding  to  a  labelling  L,  we  will  have  lv(xv)  =  -  a  dictator 

function.  Formally,  the  ^’th  dictator  function  on  {0, 1}^  is  given  by  F(x)  =  x^\  Thus,  in  the 
intended  solution  every  linear  function  lv  associated  with  the  halfspace  h  is  a  dictator 
function. 

Now,  let  us  again  restate  the  above  testing  problem  in  terms  of  these  linear  functions. 
For  succinctness,  we  write  It  for  the  linear  function  lVi.  We  need  a  randomized  procedure 
that  does  the  following: 

Given  k  linear  functions  '■  {0, 1}^  —  IR,  queries  the  functions  at  one 

point  each  (say  x\, . .  .,Xk  respectively),  and  accepts  if  sgn(X^=1  liixf)  -9)  =  b. 

The  procedure  must  satisfy, 

•  (Completeness)  If  each  of  the  linear  functions  li  is  the  ^’th  dictator  function  for  some 
£  £  [i?  ],  then  the  test  accepts  with  probability  1  -  e. 

•  (Soundness)  If  the  test  accepts  with  probability  |  +  c,  then  at  least  two  of  the  linear 
functions  are  close  to  the  same  dictator  function. 

A  testing  problem  of  the  above  nature  is  referred  to  as  a  Dictatorship  Testing  and  is  a 
recurring  theme  in  hardness  of  approximation. 

Notice  that  the  notion  of  a  linear  function  being  close  to  a  dictator  function  is  not 
formally  defined  yet.  In  most  applications,  a  function  is  said  to  be  close  to  a  dictator  if  it 
has  influential  coordinates.  It  is  easy  to  see  that  this  notion  is  not  sufficient  by  itself  here. 
For  example,  in  the  linear  function  sgn(10100xi  +  X2  -0.5),  although  the  coordinate  X2  has 
little  influence  on  the  linear  function,  it  has  the  significant  influence  on  the  halfspace. 

We  resolve  this  problem  by  using  the  notion  of  critical  index  (Definition  9.3.1)  intro¬ 
duced  in  [133]  and  has  found  numerous  applications  in  the  analysis  of  halfspaces  [41,  113, 
119].  Roughly  speaking,  given  a  linear  function  l,  the  idea  is  to  recursively  delete  its  in¬ 
fluential  coordinates  until  there  are  none  left.  The  total  number  of  coordinates  so  deleted 
is  referred  to  as  the  critical  index  of  l.  Let  cT(wi)  denote  the  critical  index  of  ivi,  and  let 
Cr(wi)  denote  the  set  of  cT{wi)  largest  coordinates  of  u>i.  The  linear  function  l  is  said  to  be 
close  to  the  i’th  dictator  function  for  every  i  in  Cr(wi).  A  function  is  far  from  every  dictator 
if  it  has  critical  index  0. 

An  important  issue  is  that  the  critical  index  of  a  linear  function  can  be  much  larger 
than  the  number  of  influential  coordinates  and  cannot  be  appropriately  bounded.  In  other 
words,  a  linear  function  can  be  close  to  a  large  number  of  dictator  functions,  as  per  the 
definition  above.  To  counter  this,  we  employ  a  structural  lemma  about  halfspaces  that 
was  used  in  the  recent  work  on  fooling  halfspaces  with  limited  independence  [41].  Using 
this  lemma,  we  are  able  to  prove  that  if  the  critical  index  is  large,  then  one  can  in  fact 
zero  out  the  coordinates  of  ivi  outside  the  t  largest  coordinates  for  some  large  enough  t, 
and  the  agreement  of  the  halfspace  h  only  changes  by  a  negligible  amount!  Thus,  we  first 
carry  out  the  zeroing  operation  for  all  linear  functions  with  large  critical  index. 

We  now  describe  the  above  construction  and  analysis  of  the  dictatorship  test  in  some 
more  detail.  It  is  convenient  to  think  of  the  k  queries  X\,...,Xk  as  the  rows  of  a  k  xR 
matrix  with  {0,1}  entries.  Henceforth,  we  will  refer  to  matrices  {0, 1}* XjR  and  their  rows 
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and  columns. 

We  construct  two  distributions  @o>®i  on  {0,  l}k  such  that  for  s  -  0, 1,  we  have  Prie@s  [  v^=1Xj  = 
s]  ^  1  -  e  for  e  =  o^(l)  (this  will  ensure  the  completeness  of  the  reduction,  i.e.,  certain  dis¬ 
junctions  pass  with  high  probability).  Further,  the  distributions  will  be  carefully  chosen 
to  have  matching  first  four  moments.  This  will  be  used  in  the  soundness  analysis  where 
we  will  use  an  invariance  principle  to  infer  structural  properties  of  halfspaces  that  pass 
the  test  with  probability  noticeably  greater  than  1/2. 

We  define  the  distribution  on  matrices  {0,1}^ xR  by  sampling  R  columns  indepen¬ 
dently  according  to  @>s,  and  then  perturbing  each  bit  with  a  small  random  noise.  We  de¬ 
fine  the  following  test  (or  equivalently,  distribution  on  examples):  given  a  halfspace  h  on 
{0,  1}**^,  with  probability  1/2  we  check  h(x)  =  0  for  a  sample  x  e  3>q,  and  with  probability 
1/2  we  check  h(x)  =  1  for  a  sample  x  e 

Completeness  By  construction,  each  of  the  R  disjunctions  ORj(x)  =  \zkj=1xj’)  passes  the 
test  with  probability  at  least  1  -  e  (here  denotes  the  entry  in  the  i’th  row  and  /th 
column  of  jc). 

Soundness  For  the  soundness  analysis,  suppose  h(x)  -  sgn ((w,x)  -6)  is  a  halfspace  that 
passes  the  test  with  probability  at  least  1/2  +e.  The  halfspace  h  can  be  written  in  two  ways 
by  expanding  the  inner  product  (iv,x)  along  rows  and  columns,  i.e.,  h(x)  =  sgn (Y.l-=i(iVi,Xi)- 
Q)  =  sgn(X^=1(n;(l),3:(l))  -  Q).  Let  us  denote  li(x)  =  (wi,Xi). 

First,  let  us  see  why  the  linear  functions  (wt,Xi)  must  be  close  to  some  dictator.  Note 
that  we  need  to  show  that  two  of  the  linear  functions  are  close  to  the  same  dictator. 

Suppose  each  of  the  linear  functions  li  is  not  close  to  any  dictator.  In  other  words,  for 
each  i,  no  single  coordinate  of  the  vector  wt  is  too  large  (contains  more  than  r-fraction  of 
the  ^2  mass  \\1ViW2  of  vector  Wi  ).  Clearly,  this  implies  that  no  single  column  of  the  matrix 
w  is  too  large. 

Recall  that  the  halfspace  is  given  by,  h(x)  =  sgn(Y.je[R](iv^) ,x^)-6).  Here  l(x)  =  Lj£[.r](w;<'7\^(j))- 
6  is  a  degree  1  polynomial  into  which  we  are  substituting  values  from  two  product  distri¬ 
butions  and  Further,  the  distributions  3> 0  and  @1  have  matching  moments  up  to 
order  4  by  design.  Using  the  invariance  principle,  the  distribution  of  l(x)  is  roughly  the 
same,  whether  x  is  from  3)^  or  3>f .  Thus,  by  the  invariance  principle,  the  halfspace  h  is 
unable  to  distinguish  between  the  distributions  and  with  a  noticeable  advantage. 

Suppose  no  two  linear  functions  li  are  close  to  the  same  dictator,  i.e.,  Cr(wi)nCT(wj)  = 

0.  In  this  case,  we  condition  on  the  values  of  x^  for  j  e  CT(u>i)  (note  that  we  condition 
on  at  most  one  value  in  each  column  so  the  conditional  distribution  on  each  column  still 
has  matching  first  three  moments),  and  then  apply  the  invariance  principle  using  the  fact 
that  after  deleting  the  coordinates  in  Cr(wi),  all  the  remaining  coefficients  of  the  weight 
vector  iv  are  small  (by  definition  of  critical  index).  This  implies  that  Cr(wi)  n  Cr{wj)  ^  0 
for  some  two  rows  i,j.  This  finishes  the  proof  of  the  soundness  claim. 

The  above  consistency-enforcing  test  almost  immediately  yields  the  Unique  Games 
hardness  of  weak  learning  disjunctions  by  halfspaces.  To  prove  NP-hardness,  we  reduce 
a  version  of  Label  Cover  to  our  problem.  This  requires  a  more  complicated  consistency 
check,  and  we  have  to  overcome  several  additional  technical  obstacles  in  the  proof. 

The  main  obstacle  encountered  in  transferring  the  dictatorship  test  to  a  Label  Cover- 
based  hardness  is  one  that  commonly  arises  for  several  other  problems.  Specifically,  the 
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projection  constraint  on  an  edge  e  =  (u,v)  maps  a  large  set  of  labels  L  =  {£\,...,£d)  corre¬ 
sponding  to  a  vertex  u  to  a  single  label  £  for  the  vertex  v.  While  composing  the  Label  Cover 
constraint  (u,v)  with  the  dictatorship  test,  all  labels  in  L  have  to  be  necessarily  equiva¬ 
lent.  In  several  settings  including  this  work,  this  requires  the  coordinates  corresponding 
to  labels  in  L  to  be  mostly  identical!  However,  on  making  the  coordinates  corresponding  to 
L  identical,  the  prover  corresponding  to  u  can  determine  the  identity  of  edge  (u,v),  thus 
completely  destroying  the  soundness  of  the  composition.  In  fact,  the  natural  extension 
of  the  Unique  Games-based  reduction  for  MaxCut  [100]  to  a  corresponding  Label  Cover 
hardness  fails  primarily  for  this  reason. 

Unlike  MaxCut  or  other  Unique  Games-based  reductions,  in  our  case,  the  soundness 
of  the  dictatorship  test  is  required  to  hold  against  a  specific  class  of  functions,  i.e,  halfs¬ 
paces.  Harnessing  this  fact,  we  execute  the  reduction  starting  from  a  Label  Cover  instance 
whose  projections  are  unique  on  average.  More  precisely,  a  smooth  Label  Cover  (introduced 
in  [95])  is  one  in  which  for  every  vertex  u,  and  a  pair  of  labels  £,£' ,  the  labels  {£,£'}  project 
to  the  same  label  with  a  tiny  probability  over  the  choice  of  the  edge  e  =  ( u,v ).  Techni¬ 
cally,  we  express  the  error  term  in  the  invariance  principle  as  a  certain  fourth  moment  of 
halfspace,  and  use  the  smoothness  to  bound  this  error  term  for  most  edges  of  the  Label 
Cover  instance.  It  is  of  great  interest  to  find  other  applications  where  a  weak  uniqueness 
property  like  the  smoothness  condition  can  be  used  to  convert  a  Unique  Games  hardness 
result  to  an  unconditional  NP-hardness  result. 


9.3  Preliminaries 

In  this  section,  we  define  two  important  tools  in  our  analysis:  i)  critical  index,  ii)  invari¬ 
ance  principle. 

9.3.1  Critical  Index 

The  notion  of  critical  index  was  first  introduced  by  Servedio  [133]  and  plays  an  important 
role  in  the  analysis  of  halfspaces  in  [41,  113,  119]. 

Definition  9.3.1.  Given  any  real  vector  w  =  £  R".  Reorder  the  coor¬ 
dinates  by  decreasing  absolute  value,  i.e.,  |ta(ll)|  ^  \w(l2)\  ^  ...  ^  and  denote  o2t  - 

L"  lu/^l2.  For  0  ^  t  ^  1.  The  t -critical  index  of  the  vector  w  is  defined  to  be  the  small- 
est  index  k  such  ^  tct^.  If  no  such  k  exists  (Mk,  \w(lk)\  >  rak),  the  r-critical  index  is 

defined  to  be  +00.  The  vector  w  is  said  to  be  r -regular  if  the  t -critical  index  is  1. 

A  simple  observation  from  [41]  is  that  if  the  critical  index  of  a  sequence  is  large  the 
sequence  must  contain  a  geometrically  decreasing  subsequence. 

Lemma  9.3.2.  (Lemma  5.5  in  [41])  Given  a  vector  iv  =  (ie(t))”=1  such  that  |u;(1)|  ^  |u/2)|  ^ 
. . .  ^  \  w(n)\,  if  the  r-critical  index  of  the  vector  w  is  larger  than  l,  then  for  any  l^i^j^l+1, 

\w^\  ^  Oj  ^  (Vl  -  T2)j-lcji  ^  (Vl  -  T2y~l\wM\/r. 

In  particular,  if  j  >  i  +  (4/r2)ln(l/T)  then  |u;^|  ^  |u;(,)|/3. 
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For  a  r-regular  weight  vector,  the  following  lemma  bounds  the  probability  that  its 
weighted  sum  falls  into  a  small  interval  under  certain  distributions  on  the  points.  The 
proof  is  in  Appendix  9.8. 

Lemma  9.3.3.  Let  w  el"  be  a  r -regular  vector  w,  and  £  \w(l)\2  =  1.  3>  is  a  distribution  over 
{-1,  l}n.  Define  a  distribution  @  on  {-1, 1}”  as  follows:  to  generate  y  from  Q),  first  sample  x 
from  Q)  and  then  define, 


with  probability  1  -  y 
random  bit  with  probability  y. 


Then  for  any  interval  [a,  b],  we  have 


Pr 


(w,y)  e  [a,  b] 


4\b-a\  4t  _ri 

^ - I - h  2e  2r2  . 

Vr  Vr 


Intuitively,  (w,y)  is  t  close  to  the  Gaussian  distribution  if  each  y(l)  is  a  random  bit  and 
therefore  we  can  bound  the  probability  that  ( w,y )  falls  into  the  interval  [a,  6].  In  above 
lemma,  each  yw  has  probability  y  to  be  a  random  bit,  then  y  fraction  of  is  set  to  be  a 
random  bit  and  we  can  therefore  bound  the  probability  that  ( w,y )  falls  into  the  interval 
[a,  bl 

Definition  9.3.4.  For  a  vector  w  e  R",  define  set  of  indices  St(w)  c  [n\  as  the  set  of  indices 
containing  the  t  largest  coordinates  of  w  by  absolute  value.  Suppose  its  T-critical  index  is 
cT,  define  set  of  indices  Cr(w)  =  SCt(w).  In  other  words,  Cr(w)  is  the  set  of  indices  whose 
deletion  makes  the  vector  w  to  be  r-regular. 

Definition  9.3.5.  For  a  vector  w  e  R"  and  a  subset  of  indices  S  c  [ n\,  define  the  vector 
Truncated, S)  e  R”  as: 


(Truncate(n;,S))(l)  = 


w{i)  if  ieS 
0  otherwise 


As  suggested  by  Lemma  9.3.2,  a  weight  vector  with  a  large  critical  index  has  a  geo¬ 
metrically  decreasing  subsequence.  The  following  two  lemmas  use  this  fact  to  bound  the 
probability  that  the  weighted  sum  of  a  geometrically  decreasing  sequence  of  weights  falls 
into  a  small  interval.  First,  we  restate  Claim  5.7  from  [41]  here. 

Lemma  9.3.6.  [ Claim  5.7,  [41]]  Let  |u;(1)|  ^  |u/2)| . . .  ^  \  w(T)\  ^  0  be  a  sequence  of  numbers 


< p  |  for  1  ^  i  ^  T  - 1 .  Then  for  any  interval  I  =  [a  -  ■ 
,  there  is  at  most  one  point  x  e  {0, 1}T  such  that  (w,x)  e  I. 


so  that  |if(l+1)|  ^  \  —, 

\w^\ 


XT) 


,a  + 


4^]  of  length 


Lemma  9.3.7.  Let  |u;(1)|  ^  \w(2)\ . ..  ^  \w(T)\  ^  0  be  a  sequence  of  numbers  so  that  |u/I+1)|  ^ 


„(i) 


|^-|  for  l^i^T-1.  Q)  is  a  distribution  over  [-1,  IF  .  Define  a  distribution  on  [-1, 1}J 
as  follows:  To  generate  y  from  sample  x  from  2>  and  set 


y(i)  = 


li) 


random  bit 


with  probability  1  -  y 
with  probability  y. 


Then  for  any  9  e  R  we  have 


Pr 


(w,y)  e  [9 


w 


C T ) 


, e  + 


w 


(T) 


< 
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Proof.  By  Lemma  9.3.6,  we  know  that  for  the  interval  J  - 


6 


\WT\  n  .  \wT\ 

~1T’V  +  ~cT 


there  is  at 


most  one  point  r  e  (-1, 1}T  such  that  (w,r)  e  J.  If  no  such  r  exists  then  clearly  the  prob¬ 


ability  is  zero.  On  the  other  hand,  suppose  there  exists  such  an  r,  then  ( w,y >  e  J  only  if 
•  •  •  ,y(P)  =  (r(1),  • .  • ,  riT))  holds. 

Conditioned  on  any  fixing  of  the  bits  x,  every  bit  y(j)  is  an  independent  random  bit  with 


probability  y.  Therefore,  for  every  fixing  of  x,  for  each  i  e  [T],  with  probability  at  least  y/2, 
is  not  equal  to  r(l\  Therefore,  Pr[y^1)  =  r(1),y(2^  =  d2\ . . .,y^  =  r(r)]  ^ (l  -  |)r.  □ 


9.3.2  Invariance  Principle 


While  invariance  principles  have  been  shown  in  various  settings  by  [31,  114,  117],  we  re¬ 
state  a  version  of  the  principle  well  suited  for  our  application.  We  present  a  self-contained 
proof  for  it  in  Appendix  9.9. 

Definition  9.3.8.  A  C4 -function  'T(x) :  R  -*  R  is  said  to  be  B-nice  if  |vT""(t)|  ^  B  for  all  t  e  R. 
Definition  9.3.9.  Two  ensembles  of  random  variables  &  -  (p  i,. .  .,Pk)  and  SI  -  (qi,...,qk) 
are  said  to  have  matching  moments  up  to  degree  d  if  for  every  multi-set  S  of  elements  from 
Yk\  | S\^d,  we  have  Etn^s/h]  =  EHIies  Qil 

Theorem  9.3.10.  (Invariance  Principle)  Let  sd  -  {A(1), . . . ,  A{R]},8$  =  be  fam¬ 

ilies  of  ensembles  of  random  variables  with  A(l)  =  {a±\ . . .  ,a^}  and  B {l}  =  {b±\ ...,  b^},  sat¬ 
isfying  the  following  properties: 

•  For  each  i  e  [i?],  the  random  variables  in  ensembles  (Atl),2?li))  have  matching  mo¬ 
ments  up  to  degree  3.  Further  all  the  random  variables  in  sd  and  S&  are  bounded  by 
1. 

•  The  ensembles  A{,}  are  all  independent  of  each  other,  similarly  the  ensembles  B 1,1  are 
independent  of  each  other. 

Given  a  set  of  vectors  l  =  [Z{11,. . .,l[R]}(l{l'>  e  RAi),  define  the  linear  function  l  \Uhl  x-  •  -xR^ 

R  as 

l(x)=  £  (l{i],x{i}) 
ie[R] 

Then  for  a  B-nice  function  T  :  R  — *  R  we  have 


E 

CD 

1 

s 

-E 

CD 

1 

I 

sd 

for  all  8  >  0.  Further,  define  the  spread  function  c(a)  corresponding  to  the  ensembles 
and  the  linear  function  l  as  follows, 


{Spread  Function:  )For  1/2  >  a  >  0,  let 

c(a)  =  max  (sup  Pr^\l(sd)  e  [8  -  a,  6  +  a]  ,  supPr^g  l(SS)  e  [6  -  a, 8  +  a]  ) 

P  L  n 


then  for  all  8, 


E  [sgn  (l(sd)  -  8)]  -  E  [sgn  (1(3$)  -  8 )] 

sd 


£  ||Zm||4  +  2c(a). 
ie[R] 
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9.4  Construction  of  the  Dictatorship  Test 


In  this  section  we  describe  the  construction  of  the  dictatorship  test  which  will  be  the  key 
ingredient  in  the  hardness  reduction  from  ^-UNIQUE  LABEL-COVER. 

9.4.1  Distributions  S>o  and  Q>i 

The  dictatorship  test  is  based  on  following  two  distributions  2>o  and  defined  on  {-1,  l}k . 
Lemma  9.4.1.  For  k  £  N,  there  exists  two  probability  distributions  3>q,  3> i  on  {— 1, 1}^  such 
that  Pr*~@0{ei >ery  xi  is  0}  ^  1  -  -^=,  Prx~g, 1  {every  Xj  is  0}  ^  -^=,  while  matching  moments  up 
to  degree  4,  i.e.,  da,b,c,d  e  \k ] 

Efx^]  E[x0]  E[x0XbXcXd]  E [x^x^x^x^ ] 

@0  ®1  @0  ®1 

E[xax&]  =  E[xax&]  E[xax&xc]  =  E[xaXfrXc] 

@0  @1  @0  @1 

Proof.  For  e  =  take  to  be  the  following  distribution: 

1.  with  probability  (1  -  e),  randomly  set  exactly  one  of  the  bit  to  be  1  and  all  the  other 
to  be  0; 

2.  with  probability  independently  set  every  bit  to  be  1  with  probability  pg ; 

3.  with  probability  independently  set  every  bit  to  be  1  with  probability  pg ; 

4.  with  probability  |,  independently  set  every  bit  to  be  1  with  probability  pg; 

5.  with  probability  independently  set  every  bit  to  be  1  with  probability  pg . 

The  distribution  3>o  is  defined  to  be  the  following  distribution  with  parameter  ei,£'2,£'3,£'4 
to  be  specified  later: 

1.  with  probability  1  -  (ci  +  C2  +  £3  +  £4),  set  every  bit  to  be  zero; 

2.  with  probability  ei,  independently  set  every  bit  to  be  1  with  probability  pg; 

3.  with  probability  62,  independently  set  every  bit  to  be  1  with  probability  pg; 

o 

4.  with  probability  e 3,  independently  set  every  bit  to  be  1  with  probability  pg; 

5.  with  probability  e 4 ,  independently  set  every  bit  to  be  1  with  probability  pg. 

From  the  definition  of  2> o,@i,  we  know  that  Prx~@0 [every  x;  is  0]  ^  1  -  (ci  +  £2  +  £3  +  C4) 
and  Prx~@x  [every  x;  is  0]  ^  e  = 

It  remains  to  determine  each  ej.  Notice  that  the  moment  matching  conditions  can  be 
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expressed  as  a  linear  system  over  the  parameters  61,62,63,64  as  follows: 


I^(^3)  =  (1-e)/^  +  Ii(7T) 

i=  1  «  i=  l4  ^3 

44>2  =  Ej42 

i=l  «  i=l4  k  3 

Eo(^)3  =  E^(-4)3 

j  =  l  }z  3  l=l  ^  }z  3 

E^fr)4  =  Ei(4)4- 

1  =  1  k  3  i  =  l4  k  3 


We  then  show  that  such  a  linear  system  has  a  feasible  solution  61,62,63,64  >  0  and 

Xf=1e^2/V^. 

To  prove  this,  by  applying  Cramer’s  rule, 
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4  A3 
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With  some  calculation  using  basic  linear  algebra,  we  get 


61  =  6/4  + 
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^p  +  o(E). 
4\4  £3 


For  big  enough  k,  we  have  0  ^  61  ^  ^=.  By  similar  calculation,  we  can  bound  62,63,64  by 
j^=.  Overall,  we  have  61  +  62  +  63  +  64  ^2/Vk 

□ 
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We  define  a  “noisy”  version  of  2>b  ( b  £  {0, 1})  below. 

Definition  9.4.2.  For  b  £  {-1, 1},  define  the  distribution  on  (-1,  l}k  as  follows: 
•  First  generate  x  £  {-1,  1}*  according  to  D^. 


•  For  each  i  £[k\, 

)Xi  with  probability  1  -  p 

uniform  random  bit  Ui  with  probability  p 

Observation  9.4.3.  Since  the  noise  is  defined  to  be  an  independent  uniform  random  bit, 
when  calculating  moments  of  y,  such  as  E^  [y^y^  •  •  -yid\,  we  can  substitute  yt  by  (1  -y)xj  + 
Tjj.  Therefore,  a  degree  d  moment  of  y  can  be  expressed  as  a  weighted  sum  of  moments  of  x 
of  degree  up  to  d.  Since  @o  and  3>i  have  matching  moments  up  to  degree  4,  3>o  and  3>i  also 
have  matching  moments  up  to  degree  4. 


The  following  simple  lemma  asserts  that  conditioning  the  two  distributions  Do  and  D\ 
on  the  same  coordinate  xj  being  fixed  to  value  b  results  in  conditional  distributions  that 
still  have  matching  moments  up  to  degree  3. 

Lemma  9.4.4.  Given  two  distributions  o,£?i  on  {-1,1}*  with  matching  moments  up  to 
degree  d,  for  any  multi-set  S  of  elements  from  [k\  |S|  ^  d  -  1,  j  £  \k ]  and  c  £  {-1, 1}. 


E[F[x;|  xj  -  c]  =  E[F[x;  |  Xj  =  c]. 
les  ^  teS 

Proof.  For  the  case  c  =  1  and  any  b  £  (-1, 1}, 


E  [Xj  n^ECn^l  xi  -  1]P  r&0[xj  =  1]  =  E  [FT  Xi  \  xj  =  1]  E  [xj]. 

ieS  &bicS  &bieS  ^ 


Therefore, 


E  [Ff  x/  |  Xj  =  1]  = 
^0  ieS 


E 3?0[Xj  rijeS  xi] 
E .  Zi )  {x  j  ] 


E^tXj  Flies  xi] 

{ Xj  ] 


E  [Ff  x;:  |  Xj  =  1]. 
^  teS 


For  the  case  c  =  0,  replace  Xj  with  x'  =  1  -Xj.  It  is  easy  to  see  that  2?o  still  have 
matching  moments  and  conditioning  on  x,-  =  0  is  the  same  as  conditioning  on  x'  =1.  Hence 

J  J 

we  can  reduce  to  the  case  c  =  1.  □ 


9.4.2  The  Dictatorship  Test 

Leti?  be  a  positive  integer.  Based  on  the  distribution  @o  and  3>i,  we  define  the  dictatorship 
test  as  follows: 

1.  Generate  a  random  bit  b  £  {0, 1}. 

2.  Generate  x  £  {-1, 1}*^  from 

3.  For  each  i  £  \k\,j  £  [i?], 

O')  j  xf  with  probability  1  -  p; 

1  [  random  bit  with  probability  p . 

4.  Output  the  pair  (y,  b).  Equivalently,  ACCEPT  if  h(y)  =  b. 
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We  can  also  view  y  as  being  generated  as  follows:  i)  With  probability  generate  a 
negative  sample  from  distribution  3>R;  ii)  With  probability  generate  a  positive  sample 
from  distribution  2>R . 

Theorem  9.4.5.  (Completeness)  For  any  j  e  [i?],  h(y)  =  vk=1y^  passes  with  probability 


Proof  If  x  is  generated  from  3)R ,  we  know  that  with  probability  at  least  1  -  -J=,  all  the 
bits  in  {x^\x^\ . . .  ,x^}  are  set  to  0.  By  union  bound,  with  probability  at  least  1  -  -J=  - 

{y^/),y2'/), . . -,y'P}  are  all  set  to  0,  in  which  case  the  test  passes  as  v*=1y^  =  0.  If  x 
is  generated  from  2>R ,  we  know  that  with  probability  at  least  1  -  ^=,  one  of  the  bits  in 

is  set  to  1  and  by  union  bound  one  of  {y(\y^\ ■  ■  •  is  set  to  1  with 
probability  at  least  1  -  in  which  case  the  test  passes  since  v*=1y*j)  =  1.  Overall,  the 

test  passes  with  probability  at  least  1  — f=.  □ 


9.4.3  Soundness  Analysis 


The  soundness  property  of  the  test  (formally  stated  in  Theorem  9.4.8)  is  that  if  some  h(y) 
passes  the  above  dictatorship  test  with  high  probability,  then  we  can  decode  each  wi  (i  e 
[£])  in  to  a  small  list  and  at  least  two  of  the  list  will  intersect.  The  proof  of  the  soundness 
property  is  based  on  two  key  lemmas  (Lemma  9.4.6,  9.4.7).  Roughly  speaking,  the  first 
lemma  states  that  if  a  halfspace  passes  the  test  with  good  probability,  then  two  of  its 
critical  index  sets  CT(wi),Cr(wj )  (see  Definition  9.3.1  )  must  intersect;  the  second  lemma 
states  that  every  halfspace  can  be  approximated  by  another  halfspace  with  a  small  critical 
index. 

Let  h(y)  be  a  halfspace  function  on  {-1,  l}kR  given  by  h(y)  =  sgn((iu,y)  -  6).  Equiva¬ 
lently,  h(y)  can  be  written  as 

h(y)  =  sgnf  Y,  <H’(j),y0)> -d]  =sgn[  £  (u)i,yi)-d) 

MR ]  Mk] 


where  iv^  e  [R/j  and  Wi  eUR . 

Lemma  9.4.6.  (Common  Influential  Coordinate)  For  r  =  let  h{y)  be  a  halfspace  such 
that  for  all  i  f  j  e  \k\  we  have  Cr(wi)  n  Cr(wj)  =  0  .  Then 

E[ft(y)]-E[My)]  ^o(^) 

Proof.  Fix  the  following  notation, 


Si  =  Truncate^;,  Ct(h;;)) 
yf  =  Truncate^,  Cr(wi)) 


li  =  IVi 


L  L 

y  =yi. 


>yi 


We  can  rewrite  the  halfspace  h(y)  as  h(y)  =  sgn^(s,yL)  +  (l,y)  -0J.  Let  us  first  normalize 
the  halfspace  h(y)  so  that  Lze[&]  ||Z;|I2  =  1.  We  now  condition  on  a  possible  fixing  of  the 


182 


vector  yL .  Under  this  conditioning  and  for  y  chosen  randomly  from  the  distribution  , 
define  the  family  of  ensembles  si  -  A^,...,A[R]  as  follows: 

Afjl  =  {y^|  Vi  e  \k\  such  that  j  £  Cr(wi)} 

Similarly  define  the  ensemble  SS  -  B^,.. .  ,B ^  using  y  chosen  randomly  from  the  distri¬ 
bution  .  Further  let  us  denote  Now  we  apply  the  invariance  principle 

(Theorem  9.3.10)  to  the  ensembles  sd ,9&  and  the  linear  function  l.  For  each  j  e  [i?],  there 
is  at  most  one  coordinate  i  e  [ k ]  such  that  j  e  Cr(wi).  Thus,  conditioning  on  yL  amounts  to 
fixing  of  at  most  one  variable  y ^  in  each  {y-^he^j.  By  Lemma  9.4.4,  since  3>o  and  @i  have 
matching  moments  up  to  degree  4,  we  get  that  A!jI  and  B {j<l  have  matching  moments  up 
to  degree  3.  Also  notice  that  maxyeyjye[&]  \l^\  ^  T||Zj|l2  ^  t|| / 1| 2  (as  li  is  a  r-regular)  and 
each  y'P  is  set  to  be  a  random  unbiased  bit  with  probability  ~h;  by  Lemma  9.3.3,  the  linear 
function  l  and  the  ensembles  si,  £&  satisfy  the  following  spread  property  for  every  6'  e  R: 


Pr 

Pr-/ 


l(sd)eW,-a,9'  +  a\ 

l(SS)  e[9r  -  a,dr  +  a] 


^  c(a ) 
<  c(a), 


where  c(a )  ^  8ak  +  4rk  +  2e  2A4  (by  setting  y  =  p  and  \b-a\  =  2a  in  Lemma  9.3.3).  Using 
the  invariance  principle  (Theorem  9.3.10)  this  implies: 


sgn^(s,yi>+  Y  (l[j],A{j])  -dj 


MR\ 


y 


E 

38 


sgn[(s,yL>+  Y 
MR] 


y 


<0(4)  E  \\l[i]\\t  +  2c(a)  (9.1) 

a  ie[R ] 


By  definition  of  the  critical  index,  we  have  ma  xj£[r]Ij  ^  r||/dl2-  Using  this,  we  can  bound 
Liem  IIJWII4  as  follows: 


E  ll^lli^4  E  E  lU-'YO4  E  (max|Z^|2)l|/dl| 

jelR]  Mk]MR]  ie[k]KJ£[R] 

oV  e  n^ii2<^2n;ii2^4- 

ie[k]  k 

In  the  final  step,  we  used  the  fact  that  r  =  ^7  and  ||  Z  ||  2  =  1  by  normalization.  Let  us  fix 
a  =  p.  The  inequality  (9.1)  holds  for  all  settings  of  yL .  Averaging  over  all  settings  of  yL 
we  get  that  (9.1)  can  be  bounded  by  0(  f>-  □ 


The  set  Cr(iVi )  can  be  thought  of  as  the  set  of  influential  coordinates  of  ir;.  In  this  light, 
the  above  lemma  asserts  that  unless  some  two  vectors  iVi,wj  have  a  common  influential 
coordinate,  the  halfspace  h(y)  cannot  distinguish  between  and  . 

Unlike  with  the  traditional  notion  of  influence,  it  is  unclear  whether  the  number  of 
coordinates  in  Cr(iVi)  is  small.  The  following  lemma  yields  a  way  to  get  around  this. 
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Lemma  9.4.7.  (Bounding  the  number  of  influential  coordinates)  Fix 


Given  a  halfspace  h(y)  and  £  e  [ k ]  such  that  \Cr{iV()\  >  t,  define  My)  =  sgn(X;e[£](w’;,;y;)-0) 
as  follows:  W(  =  Truncate(n;/,Sf(ioU)  and  wt  =  wt  for  all  if£.  Then, 


E  [My)] 


E[ft(y)l  s:  f 

k£ 


and 


E  [My)] 

6V. -R 


E  [My)] 

aifi 


1 

k 2 


Proof.  Without  loss  of  generality,  we  assume  £  =  1  and  |a/11)  \>\wf\^---^\wf)\.  In 
particular,  this  implies  St(w i)  =  [1,...,£].  Set  T  =  Ak2\og{l/k).  Define  the  subset  G  of 
St(ivi)  as 

G  =  {gi  |  gi  =  1  +  i  r (4/ t2 ) ln(  1/t)1 , 0  ^  i  ^  T }. 

Therefore,  by  Lemma  9.3.2,  \w^l\  is  a  geometrically  decreasing  sequence  such  that  |u;jjg,+l)|  ^ 
\wf'  V3.  Let  H  -  St(ivi)\G.  Fix  the  following  notation: 

wf  =  Truncate(u;i,G),  wf  =  Truncate(w;i,i/),  iv^f  =  Truncate(ii;i,{f  +  1,. ..,«}). 

Similarly,  define  the  vectors  yf  ,y^  ,y\* ■  We  now  rewrite  the  halfspace  functions  h(y)  and 
My)  as: 

h(y)  =  sgn(  ^  (ivi,yi)  +  <iof  ,yf  >  +  (of  ,yf )  +  (w\ t,y?)  -  e) 

Vi=2 

,  k 

h(y)  =  sgn  ^<i*7i,yi>  +  <iof  ,y?  >  +  <iof  ,yf  >  -  6  . 

i= 2 

Notice  that  for  any  y,  My)  ^  My)  implies 

k 

+  (w?  ^1)  +  (w?  ,yi)  ~o\  <  l<*»i*,y^>l.  (9-2) 

i=2 


By  Lemma  9.3.2,  we  know  that 


lu-'r’i2  ? 


(l-T2)^ 


U) 


>t  ii 2 


2^(log(l/r)+log.R) 


U) 


>t  ii  2 


iC 


10 


>f 


(1-T2) 


Using  the  fact  that  RWiv^W^  ^  Hio^llf,  we  can  get  that  ||m>^||i  ^  ^  \\w^t)\. 

Combining  the  above  inequality  with  (9.2)  we  see  that, 
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where  9'  =  _  >y±)  +  0-  For  any  fixing  of  the  value  of  9'  e  R,  induces  a 

certain  distribution  on  yf.  However,  the  p  noise  introduced  in  yf  is  completely  indepen¬ 
dent.  This  corresponds  to  the  setting  of  Lemma  9.3.7,  and  hence  we  can  bound  the  above 

probability  by  |l  -  ^  p-  The  result  follows  from  averaging  over  all  values  of  9'.  □ 

Theorem  9.4.8.  (Soundness)  Fix  t  =  jj  and  t  to  be  set  as  the  same  as  in  Lemma  9.4. 7.  Let 
h(x)  -  sgn ((w,y)  -6)  be  a  halfspace  such  that  St(wi)nSt(wj)  =  0  for  all  i,j  e  \k\  Then  the 
halfspace  h(y)  passes  the  dictatorship  test  with  probability  at  most  ^  +  0{\). 


Proof.  The  probability  of  success  of  h(y)  is  given  by  \  +  ^(E^t/i(y)]  -  Ej^lTdy)]).  There¬ 


fore,  it  suffices  to  show  that 


E  O.R  |7i(y)]  -  E  g*R  |7i(y)] 


<o(b. 


Define  K  -{l  \  CT(wf)  ^  t }.  We  discuss  the  following  two  cases. 

1.  K  =  0;  i.e.,  Vi  e  \k],  CT(wi)  ^  t.  Then  for  all  i,j,  St(tVi)nSt(iVj)  =  0  implies  Cr{wf)  n 

E^[My)]-E^[My)] 


CT(wj)  -  0.  By  Lemma  9.4.6,  we  thus  have 


2.  K  f  0.  Then  for  all  l  eK,we  set  it;/  =  Trwncate{w(,St(iV())  and  replace  we  with  we  in  h 
to  get  a  new  halfspace  h! .  Since  such  replacements  occur  at  most  k  times  and  by  Lemma 
9.4.7  every  replacement  changes  the  output  of  the  halfspace  on  at  most 
examples,  we  can  bound  the  overall  change  by  k  x  p  =  p  That  is 


p  fraction  of 


E[V(y)]-  E[/i(y)] 


9% 


< 


E[/i'(y)] 


E  [A(y)] 


(9.3) 


Also  notice  that  for  h'  and  all  £  e  [&],  the  critical  index  of  \CT{w()\  is  less  than  t.  This 
reduces  the  problem  to  Case  1,  and  we  conclude  E g,R\h'{y)\  - E^r \h'{y)]  =  0{l/k).  Along 
with  (9.3)  this  finishes  the  proof. 


□ 


9.5  Reduction  from  ^-Unique  Label-Cover 

In  this  section,  we  describe  briefly  a  reduction  from  ^-UNIQUE  LABEL-COVER  problem 
to  agnostic  learning  of  monomials,  thus  showing  Theorem  9.1.1  under  the  Unique  Games 
Conjecture  (Conjecture  9.2.2).  Although  our  final  hardness  result  only  assumes  P  f  NP, 
we  describe  the  reduction  to  ^-UNIQUE  Label-Cover  for  the  purpose  of  illustrating  the 
main  idea  of  our  proof. 

Let  Sd(G(V  ,E),  l,R,{nv’e\v  eV,e  eE})  be  an  instance  of  ^-UNIQUE  Label-Cover.  The 
reduction  will  produce  a  distribution  over  labeled  examples:  (y,  b )  where  y  lies  in  {0,  l}|V|xi? 
and  label  b  e  {0, 1}.  We  will  index  the  coordinates  of  y  e  {0,  l}|y|*R  by  y$  (for  w  eV ,i  eR) 
and  denote  yw  (for  w  eV)  to  be  the  vector  (y^,y®,  ...,y^). 
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1.  Sample  an  edge  e  =  (iq, v &)  e  E 

2.  Generate  a  random  bit  b  e  {-1, 1}. 

3.  Sample  x  e  {-1,  l}kR  from  2>R . 

4.  Define  ye  { — 1, i}l^lx^  as  follows: 

(a)  For  each  v  £  foi, ...,Vk),yv=  0. 

(b)  For  each  i  e  [ k ]  and  j  e  [i2],  =  x<?Vi*W 

5.  Output  the  example  (y,  b). 


Proof  of  Theorem  9.1.1  assuming  Unique  Games  Conjecture  Fix  k  =  p  = 

and  a  positive  integer  R  >  f(2^29) v2 1  for  which  Conjecture  9.2.2  holds. 

Completeness:  Suppose  that  sd  :  V"  — > ►  [i?]  is  a  labeling  that  strongly  satisfies  1  -  krj 
fraction  of  the  edges.  Consider  disjunction  h{y)  =  \/ Vev ■  For  at  least  1  -krj  frac¬ 
tion  of  edges  e  =  (vi,V2,---,Vk)  e  E,  nvi,e(sd(vi))  =  •••  =  nVk,e(sd(vk))  =  £.  As  all  coordi¬ 
nates  of  y  outside  of  {yVl,---,yvk)  are  set  to  0  in  step  4(a),  the  disjunction  reduces  to 
v i€[k]Vvf^Vl^  -  VjG [k\x^ .  By  Theorem  9.4.5,  such  a  disjunction  agrees  with  every  (y,  b )  with 

o 

probability  at  least  1  -  ^=.  Therefore  h(y)  agrees  with  a  random  example  with  probability 
at  least  (1  -  ^=)(1  -krj)  ^1-  -^=-krj'^l-e. 

Soundness:  Suppose  there  exists  a  halfspace  h(y)  =  T.vev(wv,yv)  that  agrees  with  more 
than  ^  +  e  ^  \  +  -^=  fraction  of  the  examples.  Set  t  -  £12(31og(&6)  +  logi?)  +  4&2log(l/£))  = 

O  [k 13  logfff ))  (same  as  in  Theorem  9.4.8).  Define  the  labeling  sd  using  the  following  strat¬ 
egy  :  for  each  vertex  v  e  V  randomly  pick  a  label  from  St(wv). 

By  an  averaging  argument,  for  at  least  |  fraction  of  the  edges  e  e  E  generated  in  step 
1  of  the  reduction,  h(y)  agrees  with  the  examples  corresponding  to  e  with  probability  at 
least  \  + 1.  We  will  refer  to  such  edges  as  good.  By  Theorem  9.4.8  for  each  good  edge  e  e  E, 
there  exists  i,j  e  [k\,  such  that  nVi,e[St(ivVi))  n  nvj’e[St(ivVj))  0.  Therefore  the  edge  e  e  E 
is  weakly  satisfied  by  the  labeling  sd  with  probability  at  least  4.  Hence,  in  expectation 
the  labelling  si  weakly  satisfies  at  least  |  •  ^  =  Q(fe27l^g2fl)  ^  jpjrj  fraction  of  the  edges  (by 
the  choice  of  R  and  t ). 


9.6  Reduction  from  Label  Cover 


In  this  section,  we  describe  a  reduction  from  the  bipartite  Label  Cover  problem  to  a  k- 
Label-Cover  with  an  additional  smoothness  property.  We  then  reduce  the  smooth  k- 
Label-Cover  problem  to  agnostic  learning  of  disjunctions  by  halfspaces.  This  will  give 
us  Theorem  9.1.1  without  assuming  the  Unique  Games  Conjecture. 
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9.6.1  Smooth  ^-Label-Cover 


Our  reduction  use  the  following  hardness  result  for  ^-Label-Cover  (Definition  9.2.1) 
with  the  additional  smoothness  property. 

Theorem  9.6.1.  There  exists  a  constant  y  >  0  such  that  for  any  integer  parameter  J,u^l, 

it  is  NP-hard  to  distinguish  between  the  following  two  types  ofk- Label-Cover  £d(G(V  ,E),N,M,{ nv,e\e 

E,  v  e  e})  instances  with  M  =  7(J+1)“  and  N  =  2ulJu: 

1.  (Strongly  satisfiable  instances)  There  is  some  labeling  that  strongly  satisfies  every 
hyperedge. 

2.  (Instances  that  are  not  2k22~JU -weakly  satisfiable)  There  is  no  labeling  that  weakly 
satisfies  at  least  2k22~JU  fraction  of  the  hyperedges. 

In  addition,  the  ^-Label-Cover  instances  have  the  following  properties: 

•  (Smoothness)  for  a  fixed  vertex  w  and  a  randomly  picked  hyperedge  containing  w, 

Vi  J  £  [M],Pr[nw’e(i)  =  nw'e(j)\  ^  1/J. 

•  For  any  mapping  ne,v  and  any  number  i  £  [V],  we  have  |(7re,l')_1(i)|  ^  d  -  4“;  i.e., 
there  are  at  most  d  =  4“  elements  in  [Ml  that  are  mapped  to  the  same  number  in  [A-]. 

The  proof  of  the  above  theorem  can  be  found  in  Appendix  9.10. 

In  the  rest  of  the  thesis,  we  will  set  u  —  k  and  therefore  d-Ak.  Also  we  set  the  smooth¬ 
ness  parameter  J  =  d 17  =  Allk . 

9.6.2  Reduction  from  Smooth  ^-Label-Cover 

The  starting  point  is  a  smooth  ^-LABEL-COVER  I£{G{V ,E),N ,M, {nv,e\e  eE,v  e  e})  with 
M  =  7{J+1)U  and  N  =  2ulJu  as  described  in  Theorem  9.6.1.  Following  below  is  the  reduc¬ 
tion  from  ^-LABEL-COVER  Sd(G(V,E),N,M,  {nv,e\e  £  E,v  £  e})  that  given  an  instance  of 
^-LABEL-COVER  5£  produces  a  random  labeled  example.  We  refer  to  the  obtained  distri¬ 
bution  on  examples  as  8. 

•  Pick  a  hyperedge  e  -  {v\,V2,...,Vk)  tE  with  corresponding  projections  ni,...,7tk  ■ 

[M]  -  [Nl. 

•  Generate  a  random  bit  b  £  {-1, 1}. 

•  Sample  x  £  {-1,  l}kR  from  . 

•  Generate  y  £  {-1,  i}lylxM  as  follows: 

1.  For  each  v  £  e,  yv  =  0. 

2.  For  each  i  e  [kl,  set  yVi  £  {-1, 1}M  as  follows: 

0)  j  x  with  probability  1  -  p 

|  random  bit  with  probability  p 

•  Output  the  example  (y,  b )  or  equivalently  ACCEPT  if  hiy)  =  b. 
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9.6.3  Proof  of  Theorem  9.1.1 


We  claim  that  our  reduction  has  the  following  completeness  and  soundness  properties. 
Theorem  9.6.2.  •  COMPLETENESS:  If  £8  is  a  strongly- satis fiable  instance  of  smooth 

^-Label-Cover,  then  there  is  a  disjunction  that  agrees  with  a  random  example  from 
8  with  probability  at  least  1  -  0(^=). 

•  Soundness:  If  is  not  2k22 ~rk -weakly  satisfiable,  then  there  is  no  halfspace  that 
agrees  with  a  random  example  from  8  with  probability  more  than  |  +  0(-^). 

Combining  the  above  theorem  with  Theorem  9.6.1  we  get  that  for  k  =  0(l/e2),  we  obtain 
our  main  result:  Theorem  9.1.1. 

It  remains  to  check  the  correctness  of  the  completeness  and  soundness  claims  in  The¬ 
orem  9.6.2.  First  let  us  prove  the  completeness  property. 

Proof.  (Proof  of  Completeness)  Let  L  be  the  labeling  that  strongly  satisfies  .  Consider 
disjunction  h(y)  =  V vev •  Let  e  =  (vi,V2,-.-,Vk)  be  any  hyperedge  and  let  8e  be  the 
distribution  8  restricted  to  the  examples  generated  for  e.  With  probability  at  least  1  - 
1/k,  y^v,)  =  x *  for  every  i  e  [&].  As  e  is  strongly  satisfied  by  L,  for  all  i,j  e  [£], 

ne’Vi(L(vi))  =  ne,vJ(L(v j)).  Therefore,  as  in  the  proof  of  Theorem  9.4.5,  we  obtain  that  h(y) 
agrees  with  a  random  example  from  8e  with  probability  at  least  1  -  0( l/Vk).  Labeling 
L  strongly  satisfies  all  edges  and  therefore  we  obtain  that  h(y)  agrees  with  a  random 
example  from  8  with  probability  at  least  1  -  0( l/\Jk).  □ 

The  more  complicated  part  is  the  soundness  property  which  we  prove  in  Section  9.6.4. 

9.6.4  Soundness  Analysis 

Let  h{y)  be  a  halfspace  that  agrees  with  more  than  |  +  -^-fraction  of  the  examples.  Sup¬ 
pose, 

h(y)  =  sgnf  Y,  (wv,yv)-6 ). 

veV 


Let  t  =  p3  and  let 

sv  =  Truncate^, CAh^)),  lv  -  wv-sv. 

Definition  9.6.3.  A  vertex  v  eV  is  said  to  be  f-nice  with  respect  to  a  hyperedge  e  e  E 
containing  it  if 

E  (  E  l^lf^llUa 

ie[iV]  jen-Hi) 

where  n  :  [M]  — ►  [W]  is  the  projection  associated  with  vertex  v  and  hyperedge  e.  In  other 
words, 

E  (ll^llif^iSII^Ilt 

ielNl 
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An  hyperedge  e  =  (v\,V2,...,Vk)  is  f-nice,  if  for  every  i  e  \k\  the  vertex  17  is  f-nice  with 
respect  to  e. 

Lemma  9.6.4.  The  fraction  of2r-nice  hyperedges  in  E  is  at  least  1  -  0(  1/k). 


(i)\2 


IlLUn  ^  d8 


>  4r}.  There- 


Proof.  By  definition,  we  know  that  lv  is  T-regular  vector.  Denote  Iv  =  {i 

fore,  |/|  ^  d8.  Notice  there  are  at  most  d16  pairs  of  values  in  I  x  I.  By  the  smoothness 
property  of  the  ^-Label-Cover  instance,  for  any  vertex  v,  at  leat  1  -  fraction  of  the 
hyperedges  incident  on  v  have  the  following  property:  for  any  i,j  e  I,  ne,v(i)  f  ne,v{j).  If  all 
the  vertices  in  an  hyperedge  have  this  property  we  call  it  a  good  hyperedge.  By  an  aver¬ 


aging  argument,  we  know  that  among  all  hyperedges  at  least  1 
fraction  is  good. 


=  1 


4& 


>1 


0(1) 


We  will  show  all  these  good  hyperedges  are  also  2r-nice.  For  a  given  good  hyperedge  e, 


a  vertex  v  e  e,  n  =  ne,v 

Based  on  the  above  property,  we  will  show 


and  i  e  [IV],  there  is  at  most  one  j  e  n  (i)  such  that  A—n  ^ 


IIU| 


d 8‘ 


£  (  £  IZ^l)4^  2t\\Iv\ 

ie[N]  jen-Hi) 


Notice  that 


E(  E  ia)4=  E  E  I' 

ieN  jen  Hi)  ieN 1(h 


C/l)/C/2)  7C/3)/C/4)  I 
V  ''V  ^ V 


(9.4) 


and  the  sum  of  all  the  terms  with  j\  =  72  =  73  =  74  is  11/^  H^. 

For  other  term  such  that  71,72,73,74  are  not  all  equal,  there  is  at  least 

one  \l{Jr)\  (r  e  [4])  smaller  than  Therefore,  |/|/' )/|/2)/|/3)/|/'l)|  can  be  bounded  by 

(  E  +  (#2))3  +  dv3))3  +  (iv4))3)- 

“  7lj2j3j4 


Overall,  expression  (9.4)  can  be  bounded  by 

11/, n|  +  ^  E  (dA))3 + (#2))3 + (#3))3 + (#4))3 

^  71J2J3J4 

1| /y  H2  +  ^  J^24d3||/„|||  (each  term  is  counted  at  most  4d3  times) 

9  T  a 

^(t  +4— )||i„|L  (lv  is  T-regular  vector  ) 

d 


□ 

Let  us  fix  a  2r-nice  hyperedge  e  =  (oi,...,o&).  As  before  let  §e  denote  the  distribution 
on  examples  restricted  to  those  generated  for  hyperedge  e.  We  will  analyze  the  probability 
that  the  halfspace  h{y)  agrees  with  a  random  example  from  Se. 
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Let  ni,n2, :  [M]  — < ►  [AA]  denote  the  projections  associated  with  the  hyperedge  e. 
For  the  sake  of  brevity,  we  shall  write  wt,yi,li  instead  of  wVi,yVi,lVi.  For  all  j  e  [iV]  and 
i  e  [k\,  define 

y{/]  =  Truncate^,  nj1^)). 

Similarly,  define  vectors  w'f' ,  l'.’  and  s'f' . 

Notice  that  for  every  example  (y,b)  in  the  support  of  Se,  yv  =  0  for  every  vertex  vie. 
Therefore,  on  restricting  to  examples  from  Se  we  can  write: 

h{y)  =  sgnf  £  <*»;,?!> -fl). 


Common  Influential  Variables 


Lemma  9.6.5.  (Common  Influential  Coordinate)  Let  h(y)  be  a  halfspace  such  that  for  all 
i  ^  j  g  [k\  we  have  7ii(Cr(wi))nnj(Cr(wj))  =  0.  Then 


E[/i(y)l&  =  0] 

£>p 


E[6(y)|6  =  1] 

@>p. 


< O 


Proof.  Fix  the  following  notation: 

yf  =  Truncat  e(yi,Cr(wi)) 
S  —  «i,S2,  •  •  ■  > 


L  L  L  L 

y  =yi»y2»-”>y* 
i  — 1\,  I2,  ■  ■  • ,  ik  ■ 


(9.5) 


We  can  rewrite  the  halfspace  h(y)  as  h(y)  =  sgn^(s,yL)  +  (l,y)  -0J.  Let  us  first  normalize 

the  weights  of  h(y)  so  that  Y.ieyk]  \\li\\\  =  1.  Let  us  condition  on  a  possible  fixing  of  the 
vector  yL.  Under  this  conditioning  and  also  for  6  =  0,  define  the  family  of  ensembles 
si  =  A(1*,...,AiAr)  as  follows: 

Aljl  =  jy^  |  i  G  [6],  £  G  [M]  such  that  ni(£)  =  j  and  £  £  Cr(wi)  j 

Similarly  define  the  ensemble  38  =  2J(1£..,.B(Ari  for  the  conditioning  6  =  1.  Now  we  shall 
apply  the  invariance  principle  (Theorem  9.3.10)  to  the  ensembles  id, 38  and  the  linear 
function  l(y): 

l(y)  =  £  (l{j],yU}). 

MNl 

As  we  prove  in  Claim  9.6.6  below,  the  ensembles  sd ,38  have  matching  moments  up  to 
degree  3.  Furthermore,  by  Lemma  9.3.3,  the  linear  function  l  and  the  ensembles  sd ,  38 
satisfy  the  following  spread  property: 


Pr 


£(id)  g  [O’  -  a,  O'  +  a] 


^  c(a) 


Pr 


’SB 


£(38)  e  [6r  -  a,9r  +  a] 


^  c(a) 


for  all  61  g  [R,  where  c(a)  =  8ak  +  4rk  +  2e  2*4r2  (by  setting  y  =  -h  and  |6  -a\  =  2a  in  Lemma 
9.3.3). 
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Using  the  invariance  principle  (Th.  9.3.10),  this  implies: 


E 

sgn 

sd 

<s,yL>  + 


E 

MN] 


1  1 

Jl/ 

-E 

sgn 

ss 

(s,yL)  + 


MN] 


E  l^Wlli  +  2c(a).  (9.6) 

a  ye[JV] 


Take  a  to  be  p  and  recall  that  t  =  .  In  Claim  9.6.7  below  we  show  that 

E  ll^lll^Tfc4. 

MN] 


The  above  inequality  holds  for  an  arbitrary  conditioning  of  the  values  of  yL.  Hence,  by 
averaging  over  all  settings  of  yL  we  get  that  expression  (9.6)  is  bounded  by  0(1/6).  □ 


Claim  9.6.6.  The  ensembles  si  and  have  matching  moments  up  to  degree  3. 

Let  us  suppose  for  a  moment  that  y  was  generated  by  setting  yv^  =  x(ni(j)\  that  is  with¬ 
out  adding  any  noise.  By  Lemma  9.4.1,  the  first  moments  of  random  variable  y  conditioned 
on  b  -  0  agree  with  the  first  moments  of  random  variable  y  conditioned  on  b  =  1.  As  we 
showed  in  Observation  9.4.3,  even  with  noise,  the  first  four  moments  of  y  remain  the  same 
when  conditioned  on  b  -  0  and  6  =  1.  Finally,  jii(Cr(wi))  n  nj(Cr(wj))  =  0  for  all  i  f  j  e  [6]. 
Hence  for  each  j  e  [AT],  conditioning  on  yL  fixes  bits  in  at  most  one  row  of  A(ji.  Formally, 
for  every  j  e  [AC],  there  exists  at  most  one  i  e  [6]  such  that  y{jl  and  yL  f  <p  have  shared 
variables.  Therefore,  by  Lemma  9.4.4,  sd  and  S&  have  matching  moments  up  to  degree  3. 
Claim  9.6.7. 

E  II/‘j'}II^2t£4. 

MN] 


Proof.  Since  ||Zw||i  =  Lie[k]  II 111,  we  can  write 

E  IUt/}lli^  I>4(  E  H^lli)^4  E  f  E  ll^lli)-  (9-7) 

MN]  jeN  Kie[k]  ie[k] K MN] 

As  e  =  (vi,. .  .,Vk)  is  a  2r-nice  hyperedge,  we  have  Lje[iV]  lli^ll4  ^  2x|| /j  II2-  By  normalization 
of  l,  we  know  Y.ie[k]  \\liW2  =  1-  Substituting  this  into  inequality  (9.7)  we  get  the  claimed 
bound.  □ 


Bounding  the  Number  of  Influential  Coordinates 

Lemma  9.6.8.  Given  a  halfspace  h(y)  =  sgn ('Lie[k](wi>yi)~6)  and  (  e  [6]  such  that  \Cr{ivc)\  ^ 
t  for  t  =  p(1662log(l/6)log(l/r)  +  hi(l/T)  +  101n<i)  =  0(k3°),  define  h(y)  =  sgn(Xje[*](i^i,yi)  - 
0)  as  follows: 

•  W(  =  Truncate(it^,iSf(H^))  and  vbi  =  ivi  for  all  i  f  L 

•  0  =  9 -E[(a^,yr)|6  =  0],  for  a  =  w  -  w. 
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Then, 


E[6(y)|6  =  0]-E[6(y)|6  =  0] 

S,  &e 


< 


k 2’ 


E[6(y)|6  =  l]-E[6(y)|6  =  l] 

&e 


< 


k2 


Proof.  It  is  easy  to  see  that  the  matching  moments  condition  implies  that 


E<ge[<a^,y^>|d  =  0]  =  Ege[(ae,ye)\b  =  1], 

Let  us  show  the  inequality  for  the  case  6  =  0,  the  other  inequality  can  be  derived  in  an 
identical  way  Let  <c?e,o  denote  distribution  <ge  conditioned  on  6  =  0.  Without  loss  of  gener¬ 
ality,  we  may  assume  that  £  =  1  and  |m/11,|  ^  ^  \w(fI)\.  In  particular,  this  implies 

St(w i)  =  {1,. ..,/}.  Define 

he  =  E  hf  =  E^0[(aJ},y*l}>]. 

Let  us  set  T  =  |"462log(  1/6)1  and  define  the  subset  G  =  {gi, . . .  ,gr)  of  St(wi)  as  follows: 

G  =  {gi  \  gi  =  l  +  i  r(4/T2)ln(l/T)l,0  <  i^T}. 

Therefore,  by  Lemma  9.3.2,  \w^l\  is  a  geometrically  decreasing  sequence  such  that  |u/1^I+l)|  ^ 
|u;(1^')|/3.  Let  H  =  St(wi)\G.  Fix  the  following  notation: 

iv i  =  Truncate(M;i,G),  =Trur\cate(wi,H),  =  Truncate(it>i,{/  +  l,...,n}). 

Similarly,  define  the  vectors  ,y^  ,y\f •  By  definition,  we  have  a  i  =  .  Rewriting  the 

halfspace  functions  h(y),h(y)  : 

h(y)  =  sgn  £<ici,y;>  +  <*»? ,y?)  +  (w?,y?)  +  <oi»yi *> -o), 

i= 2 

.  k 

h(y)  =  sgn  +  (ujf,yf)  +  (mf  ,yf>  +  m  -  0  . 

'  i=2 


By  Claim  9.6.9  below,  with  probability  at  most  ^  =  j,we  have  Kai,yi)-pil  ^  d4l|aill2- 
Suppose  Kai,yi)-jUil  <  c£4 II «i || 2,  then  Claim  9.6.10  below  gives  l(ai,yi)-pil  <  l/d6\w(fT^\  < 
Thus,  we  can  write 


Pr 


&e,0 


Wy)^h(y)] 


^  Pr, 


£e,0 


(iv(f,yf)£[6,--\w(l 


igr) 


where  O'  =  -£*_  _  ^1^1^  ~  hi  +  0-  For  any  fixing  of  the  value  of  O'  £  R,  induces  a 

certain  distribution  on  y^.  However,  the  -k  noise  introduced  in  yf  is  completely  indepen¬ 
dent.  This  corresponds  to  the  setting  of  Lemma  9.3.7,  and  hence  we  can  bound  the  above 


probability  by  (1  -  1/(262))T  +  1/4*  ^  (1  -  l/(2k2))4kHog(1/k)  +  1/4*  <C  1/k2.  □ 


Claim  9.6.9. 


Pr, 


e,0 


l(ai,yi>  -pil  ^  <i4||«i ||2 


1 

d' 
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Proof.  We  claim: 


Var£e>0(<ai,yi>)  <  ^rf||ai|||  +  cZ||ai|||  <  2^rf||ai|||. 

Notice  that  [M]  can  be  seen  as  the  union  of  disjoint  sets  i?iUi?2U-  •  -uRn  where  irh  =  nf  1(i). 
There  are  at  most  t  sets  such  that  Ri  nS((itii)  f  0  and,  by  the  property  of  our  &-LABEL- 
COVER  instance,  there  are  at  most  td  indices  in  those  sets.  Let  U\  =  u RinSt(w{&0Ri  and 
let  U2  =  S  \  U\.  It  is  easy  to  see  that  y^1  is  independent  of  y^2  and  therefore 

Vargefi[(ai,yi)-fii)  =  Var  g^a^1  ,y(1))  +  Var^0((a^2,yJ72>). 

The  variance  of  (a^y^1)  is  at  most 

ll®?1  II?  C  hi||a?1|||  C  td\\ai\\\. 

Notice  that  U2  is  the  union  of  all  the  Rfs  that  do  not  intersect  with  St(w  1)  .  Further  for 

R  R 

any  i,j  e  [N]  such  that  Ri  nSt(ivi)  =  0  andi?JnSdH’i)  =  0,  yx  1  is  independent  of  y1J .  Also 
since  every  Ri  has  size  at  most  d, 

Var£e0(<a?2,y?2>)  =  £  Var^0((a^',yfI>)  ^  £  d\\af  \\l  =  d||a?2|||  ^  d\\ai\\\. 

Rir\St(wi)=0  R,rSt(wi)=0 

Overall,  we  have 


Var<?e0((ai,yi>)  ^  td\\ai\\l  + d\\a1\\l^2td\\a1\\l. 

Notice  that  t  =  poly(logd)  and  by  applying  Chebyshev’s  inequality  (Th.  9.7.3),  we  have 

Pr gefi  [l<oi,yi>  ~m\  >  c£4 II «i || 2 ]  <  ^ 


□ 


Claim  9.6.10.  By  the  choice  of  the  parameters  T  and  t, 


|oil|2<^ol«;Jfr)l. 


Proof.  By  Lemma  9.3.2, 

■  .(St)\2  -  T 


\w'{*T  > 


- - 1| cr  1 1| S  ^ ^ - II01II9  ^ d10||ai||o. 

(l-T2)*-^)  2  (1_T2)^(ln(l/T)+101n(i)M  1112  ^  1112 


□ 
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Soundness  Theorem 


Recall  that  we  chose  t  =  1/k13  and  t  =  0(&3°). 

Lemma  9.6.11.  Fix  a  hyperedge  e  which  is  2 T-nice.  If  for  all  i  f  j  £  \k\  n i(St(iVi))  n 
7ij(St(wj))  =  0  then  the  probability  that  half  space  h(y)  agrees  with  a  random  example 
from  8e  is  at  most  |  +  O(^). 


Proof  The  proof  is  similar  to  the  proof  of  Theorem  9.4.8.  Define  K  =  {£  \  Cr(w/>)  >  t}.  We 
divide  the  problem  into  the  following  two  cases. 

1.  K  =  0;  i.e.,  for  all  i  £  \k\,  CT(wi)  ^  t.  Then  for  any  i  f  j  £  \k\,  St(wi)  n  St(wj)  =  0 
implies  Cr(wi)n  Cr(wj)  =  0.  By  Lemma  9.6.5,  we  have 

E[/i(y)l6  =  0]-E[/i(y)|6  =  l]  ^O(i). 

se  se  \k> 


2.  K  f  0.  Then  for  all  £  £  K,  we  set  iv e  =  Truncat e{w(,St{we))  and  define  a  new  halfs¬ 
pace  h!  by  replacing  W(  with  vb(  in  h.  Since  such  replacements  occur  at  most  k  times 
and,  by  Lemma  9.6.8,  every  replacement  changes  the  output  of  the  halfspace  on  at 
most  p  fraction  of  examples  from  8e,  we  can  bound  the  overall  change  by  k  x  p  =  I. 
That  is 


E  [M(y)] 

<®e,0 


E  [My)] 

<Se,0 


k’ 


E  [M(y)] 

&e,l 


E  [A(y)] 

&e,l 


< 


(9.8) 


For  the  halfspace  h!  and  for  all  £  e  \k\  we  have  \Cr{w()\  ^  t,  thus  reducing  to  Case 
1.  Therefore 


E[M(y)]-  E[M(y)] 

£>e,o  &e,l 


< 


(9.9) 


Combining  (9.8)  and  (9.9),  we  get 


E  [My)]  -  E  [My)] 

<=6,0  <Se,l 


<0 


In  other  words,  the  probability  that  halfspace  h(y)  agrees  with  a  random  example  from  8e 
is  at  most  |  +  0{\).  □ 


We  first  recall  the  soundness  statement: 

Proposition  9.6.12.  If  ££  is  not  a  2k22~jk -weakly  satisfiable  instance  of  smooth  &-LABEL- 
COVER,  then  there  is  no  halfspace  that  agrees  with  a  random  example  from  8  with  proba¬ 
bility  more  than  \  +  ^=. 


Proof  The  proof  is  by  contradiction.  We  can  define  the  following  labeling  strategy:  for 
each  vertex  v,  uniformly  randomly  pick  a  label  from  St(wv).  We  know  the  size  of  St(wVi)  is 
t  =  0(k3°). 

Suppose  there  exists  a  halfspace  that  agrees  with  a  random  example  from  8  with  prob¬ 
ability  more  than  |  +  -^=.  Then  by  an  averaging  argument,  for  at  least  -fraction  of  the 
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hyperedges  e,  h(y)  agrees  with  a  random  example  from  <?e  with  probability  at  least 
We  refer  to  these  edges  as  good. 


1+ 


i_ 

2vT' 


Since  there  is  at  most  0(l/£)-fraction  of  the  hyperedges  that  are  not  2r-nice  we  know 
that  at  least  ^=-fraction  of  the  hyperedges  are  2r-nice  and  good.  By  Lemma  9.6.11,  for 
each  2r-nice  and  good  hyperedge  e  there  exist  two  vertices  Vi,vj  e  e  such  that  ne,Vi(St(wi)) 
and  ne,vj(St(iVj))  intersect.  Then  there  is  a  ^  probability  that  the  labeling  strategy  we 
defined  will  weakly  satisfy  hyperedge  e. 

Overall  this  strategy  is  expected  to  weakly  satisfy  at  least  =  Q(pj)  fraction  of 

the  hyperedges.  This  is  a  contradiction  since  5£  is  not  |^--weakly  satisfiable.  □ 


9.7  Probabilistic  Inequalities 

In  the  discussion  below  we  will  make  use  of  the  following  well-known  inequalities. 
Theorem  9.7.1.  (Hoeffding’s  Inequality)  Let  x\,...,xn  be  independent  real  random  vari¬ 
ables  such  that  xi  £  [a;,6;].  Then  the  sum  of  these  variables  S  =  satisfies 

n2.2 

Pr[| S  -  E[S]|  >  a it]  ^  2e  ^U{hU)-a{l))2 . 

Theorem  9.7.2.  (Berry-Esseen  Theorem)  Let  x\,x%, ...,xn  be  i.i.d.  random  unbiased  {-1, 1} 
variables.  Also  assume  that  L”=1c^  =  1  and  max;{|cj|}  ^  a.  Let  g  denote  a  unit  Gaussian 
variable  N( 0, 1).  Then  for  any  t  £  IR, 

|Pr  [Ylcixi  ^  t\  -Prte  ^  *]|  ^  a. 

Theorem  9.7.3.  (Chebyshev’s  Inequality)  Let  X  be  a  random  variable  with  expected  value 
u  and  variance  a2.  Then  for  any  real  number  t  >  0, 

Pr[|X  -  p\  ^  t-  a]  ^  1  It2. 


9.8  Proof  of  Lemma  9.3.3 

Recall  that  each  y ^l>  is  generated  by  the  following  manner: 

m  [x(l)  with  probability  1  -  y 

y(0  =  ^  '  (9.10) 

[  random  bit  with  probability  y . 

Let  us  define  a  random  vector  z  e  {-1, 1}"  based  on  y.  For  y  generated,  if  y ^  is  generated 
as  a  copy  of  in  (9.10),  then  =  0;  if  yw  is  generated  as  a  random  bit  in  (9.10),  then 
=  1.  Let  us  write  S  =  L”=1  Our  proof  is  based  on  two  claims. 

y2 

Claim  9.8.1.  Pr[L"=1  \w{l)\2z(,)  >  y/2]  >  1  -  2e  2*2 . 

Claim  9.8.2.  For  any  a'  <b'  eU  and  any  fixing  ofza),  z{2), ...,  z{n\  i/T"=1(M>(i))22(i)  =  cr2  >  0, 
then  Pr[S  £  [a',  b']]  <  f . 
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Given  the  above  two  claims  are  correct,  define  event  V  to  be  { ^  |}  and 
use  l[a;6](x)  :  R  — »  {0,1}  to  denote  the  indicator  function  of  whether  x  falls  into  interval 
[a,  6], 


Pr[S  £  [a,  6]]  =  E[l[o>6](S)]  =  Pr[V]E[l[0j6](S)  I  V]  +  Pr[-.Vr]E[l[a>6](S)  |  -.V] 
By  Claim  9.8.1, 

r2 

Pr[-.V]E[l[o>6](S)  |  -.V]  <  Pr[^V]  ^  2e“2T2 . 


By  Claim  10.4.1, 

Pr[V]E[l[o  &](S)  |  V]  <  4(6  ~a)  +  ^=. 

/r  Vr 


Overall, 


Pr[S£[a,6]K 


4(6 -a) 

Vf 


4 r 

+  —  +  2e 


2tz 


It  remains  to  verify  Claim  9.8.1  and  Claim  10.4.1. 

To  prove  Claim  9.8.1,  we  need  to  apply  the  Hoeffding’s  inequality  (see  Theorem  9.7.1). 
Notice  that  (u/l))2z(i)  £  [0,(u;(i))2]  and  applying  Hoeffding’s  Inequality,  we  know 


Pr 

YVW 

}  -  E 

yVW0 

7  =  1 

7  =  1 

-2 nzfz 


<2e‘ 


Aw 


We  know  E[E?=1(a;(i))22(i)]  =  y  and  I"=1((u;('))2)2  <  max,-  {(iew)2}L"=1(u/°)2  <  t2.  If  we 
take  nt  -  y/2,  we  have 


Pr 


(j)\2  (i) 


-r 


7  =  1 


^2e 


r 

2t2 


Therefore,  with  probability  at  least  1  -  2e  2t2 ;  £"=1(u;(t))22(l)  ^  I. 

To  prove  Claim  10.4.1,  we  need  use  Berry-Esseen  Theorem  (See  Theorem  9.7.2).  Let 
us  split  S  into  two  parts:  S'  =  Z2i=i  wiVi  and  S"  =  L2,=o  wiVi-  Since  S  =  S'  +  S"  and  S'  is 
independent  of  S",  it  suffices  to  show  that  Pr  [S'  £  ta',6']]  ^  2|^/7a  1  +  ^  for  any  a',  6'  £  03. 

Define  y'^  =  2y^  -  1  and  note  that  y a  {-1,1}  variable.  By  rewriting  S'  using  this 
definition,  we  have 

1  4-  v'(0 

S'=  X  u;(i)y(0  =  E  . 

«(‘)=l  2(i)=l  2 


Then 


Pr  [S'  £  [a',  6']]  =  Pr 


E  u>(V(°£[a",6"] 

2®=1 


(9.11) 
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where  a"  =  2a'  -  Zz(.i)=iWM  and  6"  =  2b'  -  L2(o=1  a/j).  We  can  further  rewrite  the  above 
term  as 


Pr 

£  w(i)y'w^b" 

-Pr 

Y  u/byW  y  a" 

,z(i>  = 1 

,z(i)  =  l 

Pr 


w(i)y'(i)  0 

Y  —  y  — 

*(<)=i  Jzz^Mi]>>2  Jzzau M^2 


-Pr 


Y  —  <  — 

^(i)=i  Jzz^Mi])2  Jzz^Mi])2 


We  can  now  apply  Berry-Esseen’s  theorem.  Notice  that  for  all  the  i  such  that  =  1, 

y'(l)  is  distributed  as  an  independent  unbiased  random  {-1, 1}  variable.  Also  maxz(,)=1  .  |w  1  =  ^ 

^/l2(i)=1(^(l))2 

T 

By  Berry-Esseen’s  theorem,  we  know  that  expression  (9.11)  is  bounded  by 


Pr 


b" 

N(  0,1K— = 


-Pr 


N(0,1K— = 

Vi  z^Ml))2 


2t 

y/Lzw=1(wM? 


Using  the  fact  that  a  unit  Gaussian  variable  falls  in  any  interval  of  length  A  with  prob¬ 
ability  at  most  A  and  noticing  that  b"  -  a"  =  2(6'  -  a'),  we  can  bound  the  above  quantity 
by 

2|6'-a'|  +  2t  2|6-a|  +  2r 

\/lz<0=1(H>(i))2  a  a 


9.9  Proof  of  Invariance  Principle  (Theorem  9.3.10) 

We  restate  our  version  of  the  invariance  principle  here  for  convenience. 


Theorem  9.3.10  restated  (Invariance  Principle)  Let  sd  =  {A{1}, . .  .,A{R}},3&  =  tB{1}, . . .  ,B{R]} 
be  families  of  ensembles  of  random  variables  with  A{il  =  .  ,a^)}  and2J{l)  =  {6(1l),. . .,  6^}, 

satisfying  the  following  properties: 

•  For  each  i  £  [i?],  the  random  variables  in  ensembles  have  matching  mo¬ 

ments  up  to  degree  3.  Further  all  the  random  variables  in  sd  and  S&  are  bounded  by 

1. 

•  The  ensembles  A®  are  all  independent  of  each  other,  similarly  the  ensembles  B ® 
are  independent  of  each  other. 

Given  a  set  of  vectors  l  =  £  IR^O,  define  the  linear  function  l  :  IR/?1  x  •••  x 

[RAi?  — *■  [R  as 

l(x)=  £  (l{i],x{i}) 
ie[R ] 
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Then  for  a  B-nice  function  T7 :  OS  — ►  OS  we  have 


E 

CD 

1 

s 

-E 

§ 

i 

CD 

sd 

m 

<B  Y  ll*{i}ll}. 

ie[R] 


(9.12) 


for  all  9  >  0.  Further,  define  the  spread  function  c(a)  corresponding  to  the  ensembles  sd,SB 
and  the  linear  function  l  as  follows, 


(Spread  Function:  )For  1/2  >  a  >  0,  let 

c(a)  =  max(supPr^/ 1  l(sd)  e[9  -  a,9  +  a] 


supPr^g 

B 


l(SS)e[9-a,9  +  a] 


) 


then  for  all  6, 


E  [sgn  (/(*/) -0)]- 


E^[sgn  [im-e)] 


^  o(^)  XietR]  ll^{£}ll}  +  2c(a). 


(9.13) 


Proof.  Let  us  prove  equation  (9.12)  first.  Let  3Pi  =  {B{1\...,B^1 
We  know  that 


Etwko  - e)]  - e mim -  9)]  =  e  musca)-  9)]  -  e  mim -  ey\ 

sd  3£q 

R 

=  Y  E  mi(2£i-i)-9)]-Emi(Xi)-e)i 

Therefore,  it  suffices  to  prove 

|  E  ['Fa(^-i)-0)]-E[W(^)-d)]|^5||Z{i}|lt 

1  ay.  ^  ay.  1  x 

l  'J-'i 

Let  =  {Bll], . . .  A{i+1], . . . ,  A™}  and  we  have  3Pt  =  {‘8ri,Bii)}  and  3Ct. i  =  {^,  A®}. 

Then 


E  [W(^-i)-d)]-E[W(^)-0)]  =  E  E[W(5Kf-i)-e)]- E[W(^)-0)] 

*i-i  afi  U®  b{,) 


(9.14) 


Notice  that 


l(3Pi-1)-9  =  (l{i],A[i])+  Y  (lU),B{j])+  Y  (l{j\A[j])-9 


and 

l(afi)-0  =  <l{i},B{i}>+  £  <*y},Bl/}>  +  E  (^O1,Aw>-0. 


Take  9 '  =  T.i<:j^i-i(l^\B^)  +  'Li+i^j<^R(l^,A^)  -9,  We  can  further  rewrite  equation 
(9.14)  as 


E  f  E  ma{i],A{i])  +  9')]  -  E  V¥{(lm  ,Blii)  +  0')]] .  (9.15) 

»i  A<»  b« 
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Using  the  Taylor  expansion  of  T7,  we  have  that  the  inner  expectation  of  equation  (9.15) 
is  equal  to 


I  e  me')+'PWliU,il>  + 

1  am  2 


,{iM2 ,  'W),,,™  ,(iKx8 .  ¥""($!) 


-«zu,,Aiu>r  + 


-«I{i},A{i)»3  + 


-  E  [¥(0')  +  'F'(0')<Ztl},B{l})  +  — —  ((l[l],B{l]))z  + 

B«')  2  6 


,2  W'W') 


24 

X¥'",(52)l 


«Zfi},Af'}»4] 


24 


(9.16) 


Using  the  fact  that  A ®  and  2?*-1*  have  matching  moments  up  to  degree  3,  we  can  upper 
bound  equation  (9.16)  by 


eOVu. 

24 


l»4]- 


-■[^2)(a^  ^  ^iz{i}it 


In  the  last  inequality,  we  use  the  fact  that  'T  is  B-nice  and  (Zw,Ati})  ^  II /{l} II i,  (lM,BM)  ^ 

IU{illli. 

Overall,  we  bound  the  inner  expectation  of  equation  (9.15)  by  y|||Z{l)||4.  This  implies 
equation  (9.15)  and  therefore  equation  (9.9)  is  bounded  by  j|||Z{fl||4,  establishing  equation 
(9.12). 

To  prove  equation  (9.13),  we  need  to  use  the  following  lemma. 

Lemma  9.9.1.  ([117],  Lemma  3.21)  There  exists  some  constant  C  such  that  VO  <  A  < 
there  exists  j^-nice  function  Oa  :  K  — »  [0, 1]  which  approximates  the  sgn(x)  function  in  the 
following  sense:  Oa(Z)  =  1  for  all  t  >  A;  Oa(Z)  =  0  for  t  <  -  A. 

f' r 

By  the  above  lemma,  we  can  find  a  -nice  function  Off  such  that  Oa(Z(,e/)-0)  is  equal 
to  sgn(Z(.e/)  -  9)  except  when  l(srf)  e  [9  -  a,9  +  a\  and  Oa(Z(5§)  -  9)  is  equal  to  sgn(Z(5§)  -  9) 
except  when  1(38)  e[9  -  a,9  +  a].  Also  for  any  xeR,  |sgn(x)-Oa(x)|  ^  1  as  sgn(x)  and  Oa(x) 
are  both  in  [0, 1]. 

Overall,  we  have 


E 

sgnfzCs/)-d) 

-E 

sgn (l(3B)  -  d) 

<; 

E 

sgnfz(,s/)-d] 

-E 

Oa(z(*/)-d)l 
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96 
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L  I  l\ 

£  ll»  II4  +  2c(a). 


a 


ie[R] 


□ 


9.10  Hardness  of  Smooth  ^-Label-Cover 

First  we  state  the  bipartite  smooth  Label  Cover  given  by  Khot  [95].  Our  reduction  is 
similar  to  the  one  in  [61]  but  in  addition  requires  proving  the  smoothness  property. 
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Definition  9.10.1.  A  Label  Cover  problem  5£{GfV  ,W  ,E),N  ,M  ,{nw,v \(w  ,v)  £  E})  consists  of 
a  bipartite  graph  G(V,W,E)  with  bipartition  V  and  W,  projection  functions  jtw,v  :  [M]  — ► 

[AT]  associated  with  each  edge  (w,v)  £  E.  We  will  only  consider  instances  where  all  vertices 
in  W  have  the  same  degree.  For  any  labeling  L  :  V  — *■  [M]  and  L  :W  — ►  [AT],  an  edge  is  said 
to  be  satisfied  if  nw,v(L(v))  =  L(w).  We  define  Opt{5£)  to  be  the  maximum  fraction  of  edges 
satisfied  by  any  labeling. 

Theorem  9.10.2.  There  is  an  constant  7  >  0  such  that  for  all  integer  parameters  u  and  J,  it 
is  NP-hard  to  distinguish  the  following  two  cases:  A  Label  Cover  problem  5£  (G(V,W,E),N,M,{nw’v\(w,\ 
E})  with  M  =  7{J+1)U  and  N  =  2ulJu  having 

•  Opt{5£)  =  lor 

•  Opt(Sd)  ^  2~2yu. 

In  addition,  the  Label  Cover  has  the  following  properties: 

•  for  each  nw,v  and  any  i  £  [AT],  we  have  \{nw,vY1{i)\  ^  4“; 

•  for  a  fixed  vertex  w  and  a  randomly  picked  neighbor  ofw  called  v, 

\/i,j  e  [MlPr[jtw’v(i)  =  nw’v{j )]  <  1 U. 

Now  we  are  ready  to  prove  Theorem  9.6.1. 

Proof.  Given  an  instance  of  bipartite  Label  Cover  ££{G{V  ,W  ,E),N  ,M  ,{nw,v\{w  ,v)  eE}),  we 
can  convert  it  to  a  smooth  ^-LABEL-COVER  instance  ££'  as  follows.  The  vertex  set  of  ££'  is 
V  and  we  generate  the  hyperedge  set  E'  and  projections  associated  with  the  hyperedges 
in  the  following  way: 

1.  pick  a  vertex  w  eW; 

2.  pick  all  £-tuple  of  v’s  neighbors  v\,...,Vk  and  add  them  as  an  hyperedge  e  to  E'\ 

3.  for  each  Vi  £  e,  define  ne,Vi  =  nw,Vi. 


Completeness:  If  Opt{5£)  =  1,  then  there  exists  a  labeling  L  such  that  for  every  edge 
(w,v)  eE,  nw,v(L(v))  =  L(w).  We  can  simply  take  the  restriction  of  labeling  L  on  W  for  the 
smooth  ^-Label-Cover  instance  S£' .  For  any  hyperedge  e  -  (vi,V2,  ■■■Ak)  generated  by 
w  £  W ,  we  know  ne,Vi(L(vi))  =  L(w)  -  ne,vJ (L{v j))  for  any  i,j  e  \k ].  Therefore,  we  know  that 
there  exists  a  labeling  strongly  satisfying  all  hyperedges  in  ££' . 


Soundness:  If  Opt(Sd)  ^  2~2ju,  then  we  can  weakly  satisfy  at  most  2£22_r“-fraction  of 
the  hyperedges  in  ££' .  This  can  be  proved  via  contrapositive  argument.  Suppose  there  is 
a  labeling  strategy  L  (defined  on  V")  for  the  smooth  ^-Label-Cover  that  weakly  satisfies 
a  ^  2k22~ru  fraction  of  the  hyperedges.  Using  the  regularity  of  the  graph  in  ££' ,  we  know 
that  if  we  randomly  pick  a  vertex  w  and  randomly  pick  two  of  its  neighbors  v\,  V2  then 


Pr[nw’Vl(L(v1))  =  Jtw’V2 


(L(v2))]  > 


a 

3) 


2a 
k 2 


By  an  averaging  argument,  for  at  least  p-fraction  of  the  vertices  w  £  W ,  have  the 
following  property:  for  all  the  possible  pairs  of  w’s  neighbors,  at  least  p -fraction  have  the 
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same  labels  in  L.  For  every  w  with  this  property,  by  an  averaging  argument  again,  one 
of  la’s  neighbors,  say  vq,  must  have  the  same  label  with  at  least  ^-fraction  of  w’s  other 
neighbors.  We  can  simply  assign  w  label  ne,v°(L(vo))-  Using  such  a  labeling  strategy  (only 
on  vertices  with  the  above  property)  we  will  satisfy  at  least  =  4 -2_2r“ -fraction  the  edges 
of  5£ ,  leading  to  a  contradiction. 


Smoothness  of  ££'•.  For  any  given  vertex  v  in  we  want  so  show  that  if  we  randomly 
pick  an  hyperedge  er  containing  v,  then  for  the  projection  ne,v  as  defined  in  ££' , 

Mi,j  e  [M],Vr[ne'’v(i)  =  ne'’v(j)l  <  - 

d 

To  see  this,  notice  that  all  vertices  in  W  have  the  same  degree;  picking  a  projection  ne  ,v 
using  the  above  procedure  is  the  same  as  randomly  picking  a  neighbor  w  of  v  and  using 
the  projection  nw,v  defined  in  5£ .  Therefore, 

Vi  J  e  [MlPr[ne'’v(i)  =  7te'’v(j)  =  Pr [nw’v(i)  =  nw’v(j)]  <  ^ 

d 

□ 
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Chapter  10 

Hardness  of  Learning  Low  degree 
PTFs 
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10.1 


Introduction 


10.1.1  Motivation 

The  last  few  years  have  witnessed  a  surge  of  research  interest  and  results  in  theoretical 
computer  science  on  halfspaces  and  low-degree  PTFs,  see  e.g.  [42,  54,  55,  70,  85,  124, 132]. 
One  reason  for  this  interest  is  the  central  role  played  by  low-degree  PTFs  (and  halfspaces 
in  particular)  in  both  practical  and  theoretical  aspects  of  machine  learning,  where  many 
learning  algorithms  either  implicitly  or  explicitly  use  low-degree  PTFs  as  their  hypothe¬ 
ses.  More  specifically,  several  widely  used  linear  separator  learning  algorithms  such  as  the 
Perceptron  algorithm  and  the  “maximum  margin”  algorithm  at  the  heart  of  Support  Vector 
Machines  output  halfspaces  as  their  hypotheses.  These  and  other  halfspace-based  learn¬ 
ing  methods  are  commonly  augmented  in  practice  with  the  “kernel  trick,”  which  makes 
it  possible  to  efficiently  run  these  algorithms  over  an  expanded  feature  space  and  thus 
potentially  learn  from  labeled  data  that  is  not  linearly  separable  in  R”.  The  “polynomial 
kernel”  is  a  popular  kernel  to  use  in  this  way;  when,  as  is  usually  the  case,  the  degree 
parameter  in  the  polynomial  kernel  is  set  to  be  a  small  constant,  these  algorithms  out¬ 
put  hypotheses  that  are  equivalent  to  low-degree  PTFs.  Low-degree  PTFs  are  also  used 
as  hypotheses  in  several  important  learning  algorithms  with  a  more  complexity-theoretic 
flavor,  such  as  the  low-degree  algorithm  of  Linial  et  al.  [Ill]  and  its  variants  [81,  118], 
including  some  algorithms  for  distribution-specific  agnostic  learning  [21,  42,  84,  108]. 

Given  the  importance  of  learning  algorithms  that  construct  low-degree  PTF  hypothe¬ 
ses,  it  is  a  natural  goal  to  study  the  limitations  of  learning  algorithms  that  work  in  this 
way.  We  study  the  problem  of  learning  low  degree  PTFs  under  the  agnostic  learning  model, 
or  equivalently  the  PTF^-MA  problem. 

10.1.2  Our  Main  results 

Recall  the  definition  of  PTFs  as  follows: 

Definition  10.1.1.  For  positive  integer  d,  we  call  a  function  fix) :  IR”  — *  05  degree  d  polyno¬ 
mial  function  if  it  is  of  the  following  polynomial  expansion  form: 

z  °s  n  %i- 

multiset  SQ[n\,\S\^d  ieS 


A  degree  d  polynomial  threshold  function  is  of  the  form  sgn  (fix))  where  fix)  is  a  degree 
d  polynomial  function. 

Our  main  results  are  the  following  two  theorems.  Our  first  result  is  obtained  assuming 
the  UGC.. 

Theorem  10.1.2.  Assuming  the  UGC,  for  any  constant  d,PTF^-MA  (1-e,  1/2  +  e)  is  NP 
hard  for  any  constant  e  >  0  . 

Remark  10.1.3.  In  fact,  Our  hardness  results  also  hold  for  d  being  o(loglogn). 

Theorem  10.1.4.  PTFi-PTF2-MA  (1  -  e,  1/2  +  e)  is  NP -hard. 

Note  that  the  parameters  in  these  hardness  result  are  essentially  optimal  since  it  is 
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trivial  to  find  a  hypothesis  with  agreement  rate  \  as  we  can  randomly  choose  to  output 
function  that  is  always  -1  or  the  function  that  is  always  1. 

Our  results  immediately  implies  the  following  hardness  of  agnostic  learning  results:  i) 
Assuming  the  UGC,  even  there  exists  a  good  degree  d  PTF  that  is  consistent  with  1  -  e 
fraction  the  examples,  there  is  no  efficient  proper  agnostic  learning  algorithm  that  can 
output  a  degree  d  PTFs  correctly  labelling  more  than  |  +  e  fraction  of  the  examples;  ii) 
Assuming  P  ^  NP,  even  there  exists  a  good  halfspace  that  is  consistent  with  1  -  e  fraction 
the  examples,  there  is  no  efficient  agnostic  learning  algorithm  that  can  find  a  degree  2 
PTFs  that  correctly  label  more  than  |  +  e  fraction  of  the  examples. 

Admittedly,  our  results  do  not  rule  out  the  possibility  of  efficient  learning  algorithm 
when  e  is  sub-constant  or  unrestricted  hypothesis  may  be  used. 

10.1.3  Overview  of  the  Proof 

Based  on  the  idea  of  constructing  Dictator  Test  for  PTF,  we  now  overview  the  idea  to 
prove  Theorem  10.1.2  and  10.1.4.  In  comparison  with  the  Dictator  Test  constructed  in 
Section  2.6.2,  to  prove  Theorem  10.1.2  which  address  the  hardness  of  proper  learning 
degree  d  PTFs,  the  additional  complication  is  to  handle  the  cross  terms  (such  as  xluxJv ) 
in  degree  d  PTFs.  It  is  easy  to  see  that  for  the  Dictator  Test  3~\  there  exists  a  degree  3 
polynomial:  fe  =  (xlu  -  xlv)Y.(xlu)2  that  would  pass  the  test  with  high  probability.  However, 
fv  =  0  which  gives  no  clue  for  deciding  the  label  of  v .  The  main  innovation  of  our  proof  is 
to  design  a  proper  Dictator  Test  that  let  fe  =  xlu  -  ( xlv)d  passes  with  high  probability.  More 
specifically,  we  modify  the  test  3~\  by  setting  y  =  {a\h\  +  +  b8, <22/12  +  gf  +  b8,  ...,anhn  + 

g'nigh  ■  ■  ■ ,gn )  and  check  sgn =  b.  A  nice  property  of  such  a  test  is  that  it  force  fe  to 
have  almost  no  weight  on  the  cross  terms.  The  complete  proof  of  the  Dictator  Test  as  well 
as  Theorem  10.1.2  appears  in  Section  10.2. 

As  for  Theorem  10.1.4,  a  first  observation  is  that  the  given  test  3\  already  has  sound¬ 
ness  3/4  +  e  for  degree  2  PTFs.  To  see  this,  notice  that  r  and  -r  is  generated  with  equal 
probability,  essentially  we  are  testing  the  following  4-tuple  of  inequalities  with  equal  prob¬ 
ability: 


fe(r  +  8u)>  0; 
fe(r  -8u)  >  0; 
fe(-r  +  8u )  <  0; 
fe(-r-8u)  <  0. 


Recall  that  fe(x )  is  a  degree  2  polynomial,  we  can  write  it  as  the  sum  of  9  +  fi(x)  +  f2(x) 
where  fi(x)  is  its  linear  (degree  1)  part  and  fzix)  is  the  quadratic  (degree  2)  part. 

If  all  of  the  above  4  inequalities  hold,  combining  fe(t  +  8u )  >  0  and  fe(-t  -  8u )  <  0,  we 
get  that  fi(t  +  8u)  >  0;  and  combining  fe(t-8u)  <  0  and  fe(-t  +  8u)  >  0  we  get  fi(t-8u)  <  0. 
Therefore  for  some  degree  2  polynomial  function  f ,  if  it  passes  the  test  with  probability 
3/4  +  e,  then  by  an  average  argument,  e  fraction  of  the  4-tuple  inequalities  all  hold  which 
implies  that  for  e  fraction  of  the  r  generated,  fi(r+8u)  >  0  and  fi(r-8u)  <  0.  Then  we  know 
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linear  function  f\  pass  the  Dictator  Test  3\  with  probability  above  1/2  +  e.  This  essentially 
reduce  to  the  problem  of  testing  degree  1  PTF  which  we  already  know  how  to  analyze. 

To  further  get  the  soundness  down  to  1/2,  more  work  has  to  be  done.  Roughly  speaking, 
we  check  sgn(f(kir  +  k28u ))  =  sgn(&2)  for  ki,k2  generated  from  some  carefully  constructed 
distribution. 

In  addition  to  the  above  modification,  in  order  to  remove  the  need  of  assuming  the 
UGC,  we  use  the  “folding  trick”  that  is  proposed  in  [60,  106]  to  ensure  the  consistency 
across  different  vertices.  This  has  the  benefit  that  we  only  need  to  design  a  test  on  one 
vertex  (instead  of  an  edge).  The  reason  that  we  can  not  use  “folding”  for  our  first  result  on 
low  degree  PTFs  is  that  such  a  folding  can  not  handle  the  cross  terms.  The  complete  proof 
Theorem  10.1.4  appears  in  Section  10.3. 


10.2  On  Hardness  of  Proper  Learning  Degree  d  PTFs 

In  this  section,  we  will  prove  Theorem  10.1.2. 

10.2.1  Dictator  Test 

As  is  mentioned,  a  key  gadget  in  the  hardness  reduction  is  a  Dictator  Test  of  whether  a 
degree  d  polynomial  threshold  function  f  :  [R2re  — ►  [-1, 1}  is  of  the  form 

sgn  {Xi-xdn+i) 

for  some  i  e[n]. 

For  any  function 

fix)  =  Y  csl\xi 

multiset  S,|S|^d,SE[2?z]  ieS 

where  x  e  R2”,  our  Dictator  Test  will  query  its  value  one  a  single  point  y  and  decide  to  ac¬ 
cept  or  reject  based  on  sgn(/(y)).  For  notation  convenience,  we  refer  the  future  appearance 
of  S  as  multiset  if  not  further  clarified. 

Following  is  the  definition  of  the  test. 

Definition  10.2.1.  Fixing  parameter  (3  =  and  S  =  we  generate  one  randomized 
query  with  the  following  procedures: 


Dictator  Test  3~d 

1.  Generate  independent  /3-biased  bits  a\,a2,---,an  £  {0, 1}  (i.e.,  at  =  1  with  probability  /3 
and  0  with  probability  1  -  /3). 

2.  Generate  2 n  independent  unit  Gaussian  variables:  h\,...,hn,g\,...,gn. 

3.  Generate  a  random  bit  b  £  [-1, 1}. 

4.  Set  y  =  (axhx+gf  +  b8,a2h2  +  +  b8,..  .,anhn+ gdn,g  i, . .  ,,gn). 

5.  Accept  if  sgnl/Xy))  =  b. 
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Now  we  state  the  completeness  and  soundness  properties  of  STd- 
Lemma  10.2.2.  (Completeness)  When  fix)  -  xt  -x^+i,  it  passes  the  test  with  probability 

1- A 

Proof  We  know  that 

fiy)  =  athi  +  b8. 

Therefore,  when  at  -  0  (with  probability  1  —  /3) ,  fiy)  has  the  same  sign  as  b.  □ 

The  more  complicated  part  is  the  following  soundness  guarantee.  To  state  it,  we  first 
introduce  the  following  notion: 

Definition  10.2.3.  For  any  degree  d  polynomial  function  f  :  [R"  — ►  [R,  we  define 

wt  (/)=  Y.  Icsl- 

l<l  SKd 

We  also  define  Ieif)  to  be  {i  \  i  e  S,  |cg|  ^  9  • 

By  above  definition,  when  0^1,  Ieif)  is  not  empty  as  the  total  number  of  multiset  of 
size  at  most  d  is  (re^d). 

Lemma  10.2.4.  (Soundness)  For  d  being  a  constant  and  all  n  big  enough,  if  some  degree  d 
polynomial  function  fix)  passes  the  test  with  probability  \+fi,  then  for  f\  -  fix \,...,xn,0,...,0), 
h  =  fi0,0,...,0,xn+i,...,X2n),  we  have  |/0.5(/i)l  ^  jp,  \hif2)\  ^  In  addition,  if  for  some 
i  g  [n\  in  +  i)e  Iiifz),  we  must  also  have  i  g  /0.5(A). 

Proof.  If  wt(/‘)  =  0  which  means  fix)  is  a  constant  function,  it  passes  the  test  with  proba¬ 
bility 

Otherwise,  as  the  Dictator  Test  only  checks  the  sign  of  f  at  some  point,  with  out  lose 
of  generality,  we  can  assume  that  wt (/)  =  1. 

Set  r  =  iaihi+gf,a2h2+gc]L,...,anhn+gn,gi,g2,...,g,i)  and  u  =  (1,1,...,  1,0,0,.  0)  g  U2n 
such  that  Ui  =  1  when  1  ^  i  ^  n  and  ui  =  0  when  in  +  1)  ^  i  ^  2n.  Then  the  Dictator  Test 
defined  is  essentially  the  following: 

•  Generate  r; 

•  Test  fir  +  8u)>  0  with  probability  \  and  fir  -8u)  <  0  with  probability 

Suppose  that  some  function  fix)  passes  with  probability  \  +  f>,  then  we  know  that  for 
at  least  2  f  fraction  of  the  r  generated,  we  must  have  both  of  the  following  hold: 

fir  +  8u)>  0;  (10.1) 

fir  -8u)  <  0.  (10.2) 

Viewing  d  as  a  constant,  let  us  first  bound  the  difference  between  fir  +  8u)  and  fir): 
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(10.3) 


f(z+ Su)~  f(z)  =  y  n  (rj+5)  n  G'-n^ 

|SK<£  \ieS,ie{  1,2,...,}  i'eS,i'e{n+l,n+2,...,2n }  ieS  ) 

<  x  icSi-  x  s'ti  ■  n  ini  <  x  ksi2isi  (<?  n  mi) 

l^lSKd  2V0,Tc(S  n[«])  itT  l^lSKd  \ri\^l,ieS  / 

By  the  property  of  Gaussian  random  variables,  we  know  that  E[|r  dK£[|£^]+£[|/*,l] 
dd .  Then  by  Markov  inequality,  Pr[|r,|  ^  2 ddn2]  ^  •  By  union  bound  for  all  but 

fraction  of  the  r  generated,  we  have  that  max,;  |r  j|  ^  2ddn2. 

Given  that  max,  \r ;  |  ^  2 ddn2,  we  can  further  bound  (10.3)  by 

S2d(2ddn2)d  Y  lcsl  ^  84ddd  n2d  ^  — .  (for  u  large  enough  and  constant  d) 

l^l  SKd  2” 

Similar  calculation  shows  that  when  max,  |r,  |  ^2  ddn2, 

f(r)-f(r-5u K  T 

Therefore,  for  at  least  2/3  -  ~  ^  fraction  of  the  2  generated,  we  have  that 

iwl  <  h- 

Recall  that  f(r)  =  f(a\h\  +  gf,...,anhn  +  gd,gi,...,gn);  for  every  realization  of  a  £ 
{0,1}W,  we  denote  the  corresponding  restriction  on  f  as  fa(g,h )  which  is  a  degree  d2 
polynomial  of  on  Gaussian  random  variables  h\,...,hn,g\,...,gn.  We  use  \\fa\\2  to  denote 

E[fa(gi,...,gn,hi,...,hn)2]2 

Then  we  know  that 


1  1/2™  2 

T -  ^  a,g,hi\f a(g ,h)\  <  1/2”)  <  E  [(— —  )Vd  ]. 

logn  S  «  II  fa  II 2 


(10.4) 


where  the  last  inequality  is  due  to  the  small  ball  property  of  Gaussian  polynomials  (see 
Lemma  10.4.1). 

Suppose  a'  =  argmina  Then  (10.4)  implies  that 


-)ltd2  > 


or  equivalently 


2”  '  II  fa'  II 2  logtt 

72 

(lo  gn) 


\\fa'\\2< 


2n 


(10.5) 


Let  us  write  down  fai  as  polynomial  on  g\,...,gn,h\,...,hn,  say 

fa'  =  Y  WT,T'Y[giY\hi’ 

multiset  S,S  ieT  ieT' 
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3  Ih/A 


Let  us  further  simplify  the  notation  of  w t,^  by  wt-  Then  we  claim  that  every 

1 


Otherwise,  by  Lemma  10.4.2, 

11  ^  l>2  ^  n10d  (2n+2d2)(d2)d2 ' 

which  asymptotically  violates  (10.5)  for  large  enough  n. 

Knowing  that  each  of  the  wt  is  small,  let  us  now  establish  its  relationship  the  original 
coefficient  in  eg  (multiset  S  c  [2 n\). 

It  is  easy  to  see  the  restriction  of  fai  by  setting  all  the  h ;  to  0  is  the  same  as  f(gf , 
which  implies  that: 

e  ^Ti\si=  e  cs  n  si  n 

Tc[n]  ieT  Se[2«]  ieS,ie[ra]  n  +  ieS 

Summing  all  the  eg  such  that  S  corresponding  to  the  same  term 

n  sd  n  si, 

ieS,ie[n]  n+ieS 

we  get  wt  for  T  =  {i :  d\i  e  S}  u  {i\n  +  i  e  T}. 

Following  lemma  illustrates  when  different  cs  can  be  corresponding  to  the  same  term. 
Lemma  10.2.5.  For  any  multiset  Sq,S\^  [2  n\  of  size  at  most  d  and  SofS\,  if 

n  st  n  sj=  n  st  n  sj,  do.6> 

iESo,l^i^rl  n+jeSo,l^j^ra  n+jeSi,l^j^n 

then  there  exists  some  i,  such  that  S o  =  {(}  and  S\  =  {n  +  i  :d}  or  vice  versa. 

Proof.  (Proof  of  Lemma  10.2.5)  Let  us  discuss  the  following  two  cases. 

•  So  n  [re]  f  S\  n  [n\.  Without  loss  of  generality,  let  us  assume  that  there  is  some  i  e  So 
and  i  £  S\.  Then  to  make  (10.6)  hold,  it  must  be  the  case  that  Si  contains  d  copy  of 
n  +  i.  Also  since  |S 1 1  ^  d,  it  can  only  be  the  case  that  Si  =  {n  +  i  :  d).  Therefore  So 
must  be  {i}. 

•  So  n  {n  + 1,. . .  ,2  n)  f  Si  n  {n  + 1,. . .  ,2  n).  Without  loss  of  generality,  let  us  assume  that 
there  is  some  (n  +  i)  e  Tq.  Then  it  must  be  the  case  that  ieT  i  to  make  (10.6)  hold. 
Then  the  power  on  gi  is  d.  This  can  only  happen  when  S\  =  {n  +  i  :d}  which  enforce 
So  to  be  {i}. 


□ 


By  Lemma  10.2.5,  we  have  the  relationship  between  cs  and  wt- 
•  For  any  i  e  [u]  and  Si  =  {i},  S2  =  {n  +  i :  d),  T  -  {i :  d),  cs1  +  cs2  =  wt- 
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If  for  any  i  e  [n\,  S  f  {i}  and  S  f  {n  +  i :  d},  then  for  T  =  {i :  d  \  i  e  T}  u  {i  \  n  +  i  e  T}, 
we  have  that 


U>T  —  Cs- 


Recall  that 


and 


fi(x)  =  fix  I,...,xn,0, 

. . . ,  o)  =  Y  cs 

n  * 

S<=[rc] 

ieS 

Mx)  =  f(0,0,...,0,xn+1,..., 

x2 n)  ~  Yd 

°s  n 

S^[n+ 1,... 

,2  n]  icS 

=  fi  +  f 2  +  fi2  where 

fl2  =  Y 

cs  n**- 

|S|^d,Sn{l,2,...,?z}^0,Sn{ra+l,«+2,...,2ra}^0  ieS 

We  know  then 

•  every  eg  appear  in  fi  such  that  |S|  ^  2, 

1 


cs  < 


n 


lOd  ‘ 


for  every  eg  in  f2  and  S  is  not  the  multiset  {n  +  i  :d}  for  some  i  e  \n\, 

1 


cs  ^ 


n 


10d  ‘ 


•  for  every  i  e  [n], 


|  |C{i}|  \C{n+i\d]  1 1  ^  \C{i]  +  C{n+ j:c£}  |  ^ 


n 


lOd  ’ 


for  every  eg  appear  in  /'12, 


cs  ^ 


n 


10d  ‘ 


(10.7) 


(10.8) 


(10.9) 


(10.10) 


Since  for  f\  and  f^,  their  coefficients  are  either  matching  (such  as  qp  and  C{n+;:q)  or 
being  small  themselves,  we  have  that 


|wt(/i)- wt(/2)l  ^  0(- 


n 


10  d 


a  n 


(10.11) 


Also  by  (10.10)  as  every  efficient  in  /12  is  less  than  and  there  are  at  most  (2ra^~rf)  of 


them,  we  know  then 


Wt(/i2)  ^ 


n 


10  d 


2  re  +  cn  1 
d  I  ""  n 


Therefore,  recall  that  wt(/i)  +  wt(/2)  +  wt(/i2)  =  wt (/),  we  have 


wt(/i)  +  wt(/2)  ^  1 - • 

n 


(10.12) 
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Combining  (10.11)  and  (10.12),  we  know  that  for  n  ^  10, 

0.51  ^  wt(/i),wt(/2)  ^  0.49. 


Therefore,  every  element  ( n  +  i)  in  Ii(f2),  it  must  come  from  those  sets  S  such  that  eg  ^ 
0.49/('^d).  By  (10.8),  we  know  it  can  only  be  the  set  S  =  {n  +  i :  d }  as  all  the  other  eg  are  less 
than  By  (10.9),  we  know  that  /({(})  ^  and  it  must  be  in  /0.5(A)  as  wt(A)  ^  0.51. 

By  above  proof,  we  also  know  IA(A)I  ^  |/o.5(A)l  as  any  (n  +  i)  e  A(A)  implies  that 
i  e  /0.5(A)-  It  remains  to  bound  the  size  of  /0.5(A)  by  ^2- 

Let  us  prove  it  by  contradiction.  Suppose  that  /0.5(A)  ^  As  wt(A)  ^  0.49,  every 
j  e  /0.5(A)  comes  from  the  set  S  =  {7}  as  all  the  other  eg  is  less  than  Then  when 

we  consider  all  the  possible  realization  of  a,  with  probability  1  -  (1  -  p)d(fi)\ I  ^  1  _  I  there 
exists  some  i  e  /0.5(A)  with  at  =  1.  By  the  definition  of  /0.5(A))  we  also  must  have 


0.5-0.49 

^  rn+d  ^ 


tn+a\ 

1  d  J 


0.2 


Then  there  will  be  a  term  qpA  in  the  expansion  of  fa  as  a  Gaussian  polynomial  of 
g  and  h  such  that  |cp}|  ^  0  2/(^+dj  This  suggests  that  for  (1  -  ^)  of  the  realization  of  a, 

(  d  )  (  d2  )(rf2)d  n 

Then  we  have 

1  1  n2d 

; - <  Prffl!^(l  faigM  <  1/2")  <  -  +  0(— - ) 

logn  n  2n/d 

which  leads  to  a  contradiction  for  big  enough  n. 

□ 


10.2.2  Hardness  Reduction  from  Unique-Games 


With  above  Dictator  Test,  we  now  prove  Theorem  10.1.2.  The  hardness  reduction  is  from  a 
Unique-Games  Instance  Sd(U  ,V  ,E  ,U,k)  to  a  distribution  of  positive  and  negative  exam¬ 
ples.  The  examples  in  the  learning  problem  lies  in  the  space  Rfltfi+IVDA  labeled  with  either 
positive  (+1)  or  negative  (-1).  Denote  dim  =  (|£/|  +  \V\)k.  For  y  e  Rdim,  each  coordinate  is 
indexed  by  a  possible  label  for  a  vertex  in  U  u  V. 

We  fix  the  following  notations:  w  eU  uV  and  x  e  Rdim,  we  use  xlw  to  denote  the  coordi¬ 
nate  corresponding  to  the  vertex  w’s  i-th  label.  Also  we  use  xw  to  indicate  the  collection 
of  coordinates  of  coordinates  corresponding  to  vertex  w;  i.e.,  (x*  ,x^,...,x*).  Also  for  any 
function  f(x) :  [Rdim  — »  R,  we  use  fu  to  denote  f\ s  restriction  by  setting  all  the  coordinate  to 
be  0  except  xu.  Similarly,  denote  fu>v  as  the  restriction  of  f  by  setting  all  the  coordinate  to 
be  0  except  xu,xv. 


We  construct  the  example  distribution  from  the  UNIQUE-GAMES  instance  by  the  fol¬ 


lowing  procedures.  Let  us  choose  parameter  /3  ■ 


log(£) 


and  5  =  2 


-kz 
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Reduction  from  UNIQUE-GAMES 

1.  Randomly  choose  an  edge  ( u ,  v)  for  u  e  U  and  v  eV. 

2.  Setting  yw  =  0  for  any  w  e  U  u  V  such  that  w  f  u,w  f  v. 

3.  Generate  independent  /3-biased  bits  ai,a2,. . .  ,a*.  e  {0, 1}  (i.e.,  at  =  1  with  probability  f 
and  0  with  probability  (1  -  ft)  and  generate  2k  Gaussian  h\,...,hk,gi,...,gk. 

4.  Generate  a  random  bit  b  e  {-1, 1}. 

5.  For  every  i  e  [&],  set  =  gt 

6.  For  every  i  e  [k\  set  y ^  =  a\h\  +  (gne(i))d  +  8b. 

7.  Output  example-label  pair  (y,  b). 

We  will  prove  the  following  two  Lemmas  (10.2.6  and  10.2.7)  for  the  reduction. 

Lemma  10.2.6.  (Completeness)  If  Opt(if)  =  1  -rj,  then  there  is  a  degree  d  polynomial 
threshold  function  that  is  consistent  with  1  -  tj  -  (5  percentage  of  the  examples. 

Proof.  (Proof  of  Lemma  10.2.6)  Suppose  that  there  is  a  labeling  l  that  satisfies  1  -rj  edges. 
Then  consider  the  following  degree  d  polynomial  threshold  functions: 

sgn(£ 

ueU  veV 

It  is  easy  to  verify  that  such  a  PTF  agrees  with  1  —  77  —  /3  fraction  of  the  examples.  O 

Lemma  10.2.7.  (Soundness)  If  Opt(«Sf )  ^  l/k&^\  then  there  is  no  degree  d  polynomial 
threshold  function  agrees  with  more  than  \  +  2/3  fraction  of  the  examples. 

Proof.  (Proof  of  Lemma  10.2.7)  We  prove  above  lemma  by  contradiction.  Suppose  that 
there  is  some  degree  d  polynomial  function  f  that  passes  the  test  with  probability  \  +  2/3. 
Then  by  an  average  argument,  for  f  fraction  of  the  edge  (u,v)  picked  in  the  first  step,  we 
have  that  f(x)  passes  the  test  with  probability  \  +  f>-  Let  us  call  these  edges“good”. 

For  a  particular  “good”  edge  e  =  ( u ,  v),  let  us  assume  that  ne  is  the  identity  mapping  for 
notation  convenience. 

Essentially,  we  are  conducting  our  test  for  3~d  for  fu  v  with  parameter  n-k. 

Since  fu<v  passes  the  test  with  probability  \  +  P,  By  Lemma  10.2.7,  we  must  have  that 
lo.sifu)  -Ii(fv)  7 ^  0  (if  we  index  xlu  by  i  when  output  /o.s(/u)  and  index  xlv  by  i  when  output 
Io.5(fv))-  In  addition,  we  have  that  \Ii(fv)\,\Ios(fu)\  ^  1//32- 

We  now  give  the  following  labelling  strategy  based  on  f .  For  every  u  e  U,  we  randomly 
pick  its  label  from  /o.sl/’u)  and  for  every  v  e  V ,  we  randomly  pick  its  label  from  I\{fv).  Then 
for  each  good  edge,  it  will  get  satisfied  by  probability  /32.  Overall  such  a  labelling  strategy 
gives  a  labelling  that  satisfies  at  least  fP  =  (|q^)3  fraction  of  the  edges  in  expectation.  This 

is  a  contradiction  to  the  fact  that  Opt(i^)  ^  1/k for  sufficiently  large  k.  □ 

By  above  proof,  we  gave  a  way  of  constructing  a  distribution  of  example-label  pairs 
from  a  instance  of  Label  Cover  L.  Let  us  use  Opt(@)  to  denote  the  accuracy  of  the  best 
degree  d  PTF  on  @.  And  our  constructed  distribution  has  the  following  properties: 
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•  If  Opt(^f)  =  1  —  77,  then  Opt(@)  =  1  -rj- 

•  If  Opt(^f)  <  l/ke{ then  Opt(@)  <  \  + 


10.2.3  Discretizing  the  Gaussian  Distribution 


Above  reduction  is  not  in  polynomial  time  as  the  resulting  distribution  Q>  has  infinite 
support.  If  we  look  into  the  construction,  for  every  edge  picked  we  need  to  generate  2k 
independent  Gaussian  h  =  (hi,...,hk),g  =  (gi,---gk)- 


To  “discretize”  the  reduction,  we  will  replace  each  h  and  g  by  some  h!  and  g'  where  each 
h'  and  g'-  are  independently  generated  by  sum  of  N  bits  divided  s/N  where  N  =  (2k)24('d  J  . 


By  Theorem  10.6.2,  there  exists  a  way  of  coupling  (g,h)  with  ( g' ,h ')  such  that  for  every 
degree  d2  polynomial,  it  has  the  same  sign  on  (g,h)  as  on  ig',h')  except  for  1/k  fraction  of 
the  (g,h)  generated.  Therefore,  if  we  replace  (g,h)  with  ( g' ,h ')  in  the  reduction  and  also  no¬ 
tice  that  for  every  realization  of  a ,  the  resulting  polynomial  on  g  1 , . . . ,  gk  and  hi,l%2,---,hk 
is  of  degree  at  most  d2,  our  discretized  reduction  will  almost  preserve  the  soundness  and 
completeness  guarantees  with  a  loss  of 


•  If  Opt(.S? )  =  1  -  r],  then  Opt(@)  =  1  -  p  -  -  1/k . 

•  If  Opt(^f)  <  l/ke^\  then  Opt (3>)  <  \  +  ^  +  1/k. 


Also  notice  that  the  distribution  of  (g1 ,h')  has  a  support  of  size  22kN  =  22^2024(d  (  which 
is  constant  as  the  label  size  is  regarded  as  constant  for  Unique-Games.  Then  we  can 
simply  enumerate  all  its  support  to  further  remove  the  need  of  random  bits  and  make  the 
reduction  deterministic. 

Eventually,  by  picking  proper  77  and  k  (e.g.,  p  =  e/2  and  k  =  e1/<r2),  we  prove  Theo¬ 
rem  10.1.2. 


10.2.4  For  d  being  Super-constant 

From  above  proof,  our  hardness  result  hold  for  any  constant  d.  Actually,  it  is  easy  to  see 

that  d  =  o(loglogn),  our  proof  will  still  work,  (we  need  22^2fe)24W  *  to  be  some  polynomial 
on  the  size  of  the  label  cover.) 

10.3  Hardness  of  Learning  Halfspaces  with  degree  2 
PTFs 

In  this  section,  we  prove  Theorem  10.1.4.  Again  the  proof  has  two  parts.  In  the  first  step, 
we  construct  a  dictator  test  for  degree  2  PTFs.  In  the  second  step,  we  compose  such  a 
dictator  test  with  the  Label  Cover  problem  to  prove  NP-hardness  result. 


213 


10.3.1  The  Dictator  Test 


The  key  gadget  in  the  hardness  reduction  is  a  Dictator  Test  of  whether  a  degree  2  polyno¬ 
mial  threshold  function  f  :Rn  -*  {-1, 1}  is  of  the  form  sgn(xj)  for  some  i  £  [ n\. 

Suppose  p  is  a  degree  2  polynomial  function  written  as  the  following  form: 

p(x)  =  6  +  Y  CiXi+  Y  CijXiXj. 

ie[n] 

We  also  write  p±(x)  =  LcjXj  and  p 2(x)  =  Y.CijXij  to  denote  the  degree  1  and  2  part  of  p(x). 

Below  is  a  one  query  Dictator  Test  3\  on  sgn(p(x)).  We  choose  parameter  /3  =  t  ^  and 
8  =  ^  for  the  test. 

Test  3~2 

1.  Generate  independent  /3-biased  bits  ai,a2,...,a„  £  {0,1}  (i.e.,  a;  =  1  with  probabil¬ 
ity  e  and  0  with  probability  (1-/1)  and  generate  n  independent  Gaussian  variables 
gi,...,gn.  Set  r  =  (aig1,a2g2,---,angn)- 

2.  Generate  t  by  randomly  pick  a  number  i  e  {1,2,.. .  ,(logre)2}  and  set  t  -  nl. 

3.  Generate  random  bit  b  £  {-1,  lj.item 

4.  Set  u  £  IR"  to  be  the  all  “1”  vector  (1, 1, 1, . . . ,  1)  and  set  y  =  t3r  +  bt2Su. 

5.  Accept  if  sgn(p(y))  =  b. 


For  above  test  3\,  We  have  the  following  completeness  and  soundness  properties. 
Lemma  10.3.1.  (Completeness)  Ifp(x)  =  x;  for  i  =  l...,n,  then  it  passes  with  probability  at 
least  1-/5. 

Proof.  If  p{x)  =  xiii  £  [re]),  then  as  long  as  at  is  set  to  zero  in  step  1,  p(x)  =  and  it 

passes  the  test.  By  definition  of  the  test,  this  happens  with  probability  1-/3.  □ 

Lemma  10.3.2.  (Soundness)  Denote  A  =Y.Ci  and  I(p)  to  be  the  set  {i  |  Cj  >  A/re2}.  If  some 
p  passes  the  test  with  probability  \  +  /3,  then  |/(p)|  ^  1//32  and  A  >  0. 

Proof  The  proof  is  by  contradiction.  Suppose  for  some  function  p  with  \I(p)\  1//32  or 

A  ^  0  passes  above  defined  test  with  probability  \  +  /3. 

First  we  show  is  the  following  lemma. 

Lemma  10.3.3.  Pr[pi(r)£  (-<5A,5A)]  ^  |. 

Proof  It  is  obvious  when  A  ^  0  above  inequality  hold.  Otherwise,  assuming  A  >  0  and 
\Kp)\  ^  1//32.  We  know  that  in  step  1  when  generating  a  j,  with  probability  1  —  (1  —  fi)d(p)\  ;> 
1  -  ^  at  least  one  of  the  coordinate  in  I  ip)  is  set  to  a  Gaussian  (instead  of  zero).  For  these 
(1  -  ^)  fraction  of  x,  we  know  that  no  matter  which  other  coordinate  is  set  to  be  Gaussian, 
Pi(r)  is  a  Gaussian  variable  with  variance  at  least  A2/re4  (as  one  of  the  weight  is  at  least 
A/re2).  Using  the  anti-concentration  of  Gaussian  variable  (Lemma  10.4.1),  we  have  that 

2<5A  re3  1 

PrvIPi(r)  e  <-SA,,SA>K  ^  s:  -  s: 
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By  union  bound,  we  know  that  for  at  most  |  of  the  x,  p{r )  is  inside  the  interval 
(SA,6A).  □ 

Notice  that  r  and  -r  are  generated  with  equal  probability,  essentially  a  equivalent 
test  to  3~2  would  be  the  testing  the  following  4  inequalities  with  equal  probability  for  r,  t 
generated. 


p{t3r  +  t2Sa )  >  0 

(10.13) 

p(t3r  -  t2Sa)  <  0 

(10.14) 

p(-t3r  +  t28a )  >  0 

(10.15) 

p(-t3r  -  t2Sa )  <  0. 

(10.16) 

As  p(y)  passes  the  test  with  probability  |  +  /),  using  an  averaging  argument,  for  /i/2 
fraction  of  the  r,  ^  +  /i/2  fraction  of  the  constraints  containing  the  r  are  satisfied.  For  these 
/i/2  fraction  of  r,  let  us  remove  the  fraction  r  (of  probability  at  most  2 In)  such  that  p\(r)  e 
(-SA,8A),  Recall  that  /i  =  we  know  there  are  at  least  /i/4  fraction  of  r  remaining.  We 
call  these  r  “good”. 

Let  us  fixed  a  good  r.  By  an  averaging  argument  again,  for  any  “good”  r,  for  at  least  /i/4 
fraction  of  the  t  generated,  3  out  of  the  4  of  the  inequalities  in  the  4-tuple  that  contains 
t  and  r  are  satisfied.  There  are  4  different  ways  of  choosing  3  out  of  the  4  constraints. 
Without  loss  of  generality,  let  us  assume  that  for  /i/16  fraction  of  the  t,  the  first  three 
constraints  are  satisfied.  That  is: 


p(t3r  +  t2Sa)>  0  (10.17) 

p(t3r-t2Sa)<  0  (10.18) 

p(-t3r  +  t2Sa)  >  0  (10.19) 

Let  us  call  these  t  “good"  for  the  corresponding  r  and  define  the  set  that  contains  all  the 
“good”  t  for  a  given  “good”  r  to  be  Tr.  Since  the  possible  choice  of  t-nl  is  from  for  each  i 
from  [log2  /z],  therefore  we  know  |T)|  ^  (logra)2  •  /i/16  =  O(logra). 

Since  p{x)  is  a  degree  2  polynomial,  we  can  express  p(r  +  8 a)  (by  Taylor  Expansion)  as: 
p{r  +  8a)  =  6+pi{r)  +  p2{r)  +  8Y,Ci  +  82YJCij  +  8  X  Cij{ri  +  rj). 

Denote  B  =  and  p'2(x)  =  Li ^i^j^ncij(ri  +  rj)-  We  can  rewrite  (10.17),  (10.18), 
(10.19)  as: 

f3pi(x)  +  t28A  +  t6p2(x)  +  t58p'2(x)  +  t482B  +  6>  0  (10.20) 

f3pi(x)  -  t28A  +  t6p2(x )  -  t58p2(x)  +  t482B  +  6<  0  (10.21) 

f3pi(x)  +  t28A  -  t6p2(x )  -  t58p2(x)  -  t482B  -6>  0  (10.22) 

Notice  that  (10.20)  and  (10.22)  are  equivalent  to 

pi(x)  ^  -8 Alt- \-  \  t3p2(x)  +  8t2p'2(x)  +  82tB  +  8/t3 1. 


215 


Since  we  already  know  that  p\(x)  £  (-8 A,  8 A)  and  t  ^  n,  therefore 

Pi(x)  ^  8A. 

Also  for  (10.21),  we  can  rewrite  it  as 

Pi(x)  ^  8A/t-(t3p2(x)  +  8t2p'2(x)-82tB  +  9/t3). 

Let  us  further  simplify  the  notation  by  denote  C  =  p 2{x),  D  =  8p'2(x )  and  E  =  82B.  Then 
we  rewrite  above  constrains  as  follows: 


and 


p  i(x)  ^  -8 Alt  +  I  t3C  +  t2D  +  tE  +  6/t3 1 


p  i(x)  ^  8 Alt  -  ( t3C  +  t2D  -tE  +  Q/t3). 


Notice  that  above  (upper  and  lower)  bound  hold  for  any  t  in  Tr.  Therefore,  we  know 
that  for  any  t±,  1 2  £  Tr, 

8 Alt  1  -  (; t\C  +  t\D  -  tE  +  e/t\)  ^  -8A/t2  +  \t\C  +  t2D  +  t2E  +  6/t\\ 
which  is  equivalent  to 

An  ,  j.2] 


—  (t^C  +  t^D  —  t\E  +  6/t3^)  +  dA( — l — )  I  t^C  +  t2D  +  t2E  +  6/t 2 1 . 

U  1 2 


(10.23) 


Using  the  fact  that  p(x)  >  <5A,  therefore  -(t^C  +  -  t\E  +  0/t^)  >  (1  -  ^)5A.  Combing 

this  with  (10.23),  we  know  that  for  any  t\,  t2  e  Tr,  we  have 

(-  +  -) 

-{t\C  +  t\D-t1E  +  6!t\X  1  +  -y — ^-)  ^  |  fijc  +  t\D  + t2£  +  Q!t\  | . 


£1 

(J-  +  J-) 
l‘i  ‘2' 


By  definition,  ti^  n  for  any  i;  we  have  ^  ^  3/n  Therefore,  for  any  t\,t2  in  Tr,  the 


n 


following  inequality  holds: 

-(^C  +  ^Z)-U^  +  0/^) 
|  t^C  +  t2D  + t2.E  +  d/^2 1 


(i  +  i) 

1  +  ^W 


^1-3/rc. 


(10.24) 


<1 


We  know  |Tr|  =  0(/3(logu)2)  =  O(logra).  Actually,  we  only  need  the  fact  that  \Tr\  ^  5.  Let  us 
pick  *o  <  *i  ^  *2  <  *3  ^  ^4  from  TV,  and  denote  G  =  -(t^C  +  t2D  -  t\E  +  0/t^).  We  know  that 

G  ^  t ^ |G|  + t  ^ | D |  + 1 1  \E |  +  |0|/£^. 

Also  for  to,t2,ts,t±,  we  write: 


+  t{)E  + 

(10.25) 

4c  +  t|D  +  t2£  +  d/i^; 

(10.26) 

(10.27) 

+  t^E  +  6/t\. 

(10.28) 
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(10.29) 


Denote  F  =  max;=o,2,3,4  \Fi\,  by  (10.24)  we  know 

F 

—  >1-3  In. 

G 

Viewing  C,D,E,9  as  unknown  variable  and  solving  above  linear  system  consists  equation 
(10. 25), (10. 26), (10. 27), (10. 28)  using  Cramer’s  rule,  we  have 


C  = 


*0 

tl 

*0 

^0 

i/*| 

1 2 

^2 

f2 

^3 

i/*| 

f3 

*4 

4 

l/tf 

Notice  that  to  <  t2  <  t3  <  t±, 


to 

t2 

t3 

*4 


Vtl 

Vtl 

Vtl 


t 


to 

t2 

t3 

1 4 


yi 

vt\ 
l M 
l/tl 


is  0{t\t\t2t^). 


4  3^0 

Since  F  =  max;=o,2,3,4  l-Fd,  we  know  that 


F2 

Fs 


ti 


to 

t2 

t3 

t4 


Vtl 

Vt\ 

Vt\ 

Vt\ 


O(Ftlt3t03) 


Then  we  have  C  -  0(- 


:)• 


Fimilar  analysis  shows  that 


); 


F 

D  =  0( - 

t3t2 

F 

E  =  0(— ); 
1 2 

e  =  0{Ft3). 


Therefore,  we  have 

j.3  ,  .2, 


G  ^  |C|^  +  ^|D|  +  fil^l  +  \e\ /t\  <  0(F(t\/t2t3  +  t\/t2  +  ti/i3  +  ^i))- 
Then  notice  that  ti+i/ti  >  n  as  they  are  different  power  of  n,  we  have 


G  1 

-  =  0{t\/t2t3  +  t\tt2  +  t\/t3  +  ij/ii)  <  O(-). 
F  u  n 


This  contradicts  (10.29). 


□ 
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10.3.2  Hardness  Reduction  from  Label  Cover 

Recall  that  our  reduction  is  from  the  Label-Cover  instance  5£  specified  by  (U,  V,E,k,m,Yl) 
.  For  notation  convenience,  let  us  use  F(q) :  U  u  V  — »  N  to  denote  the  possible  choice  of  la¬ 
bels  for  vertex  q ;  i.e.,  for  u  eU,  F(u)  =  k  and  for  v  eV,  F(v)  =  m. 

The  examples  in  the  learning  problem  we  reduce  to  lies  in  the  space  RltHA+lviw  labeled 
with  either  positive  (+1)  or  negative  (-1).  Denote  dim  =  \U\k  +  \V\m.  For  y  £  [Rdim,  each 
coordinate  is  indexed  by  a  possible  label  for  a  vertex  in  UuV.  We  fix  the  following  nota¬ 
tions:  For  q  £  U  u  V,  we  use  yq )  to  denote  the  coordinate  corresponding  to  the  vertex  q’s 
i-th  label  ( i  £  [F(g)] ).  We  use  vector  yq  to  denote  all  the  coordinates  of  y  corresponding  to 
vertex  q’s  labels. 

Following  is  the  reduction,  briefly  speaking  we  want  to  conduct  the  Dictator  Test  5% 
on  the  restriction  of  pv(x)  for  v  eV.  For  given  (U,V,E,k,m,Yl)  and  choose  the  parameter 

t0  be/3=I^and<5=^- 


Reduction  from  LABEL- COVER  ££ 

1.  Randomly  pick  an  vertex  v  eV. 

2.  For  each  w  £  U  u  V,  w  ^  v,  yw  =  0. 

3.  Generate  independent  /3-biased  bits  a\,a2,---,am  £  {0,1}  (i.e.,  a;  =  1  with  probability  e 
and  0  with  probability  (1  -  ft) 

4.  generate  m  independent  Gaussian  variables  g i, . . .  ,gm- 

5.  Generate  t  by  uniform  randomly  pick  a  number  i  £  {1,2, . .  ,,(logm)2},  then  set  t-ml. 

6.  Generate  random  bit  b  £  {-1, 1}. 

7.  Set  r  =  (aigi,a2g2,---,amgm)- 

8.  For  a  £  IR"  to  be  the  all  “1”  vector  (1, 1, 1, ... ,  1),  set  yv  =  t3r  +  bt2Sa. 

9.  Output  example-label  pair  (y,  b )  (with  folding  steps  specified  later). 

The  learning  problem  is  to  find  a  degree  2  polynomial  p  :  Udim  —  {-1, 1}  such  that  sgn(p(y))  = 
b  for  as  many  example-label  pairs  as  possible.  Let  us  denote 

P{y)=e+  £  cW+  E  ctiq2)y{q\y^l 

qeUuV,ie[F(q)\  qi,q2£UuV,i£[F(qilje[F(q2)] 

Notice  that  in  the  reduction  when  vertex  v  is  picked,  we  set  all  the  coordinate  to  zero 
except  yv.  Essentially,  we  are  conducting  test  3\  on  the  function 

Pv  =  0+  E  4l)l4l)+  E  cMi),vV))yv)yv) 
i£[m\  i,j£[m ] 

which  is  the  restriction  of  p(y)  by  setting  all  the  coordinate  to  zero  except  those  coordinates 
corresponding  to  vertex  v.  The  fraction  of  agreement  of  p{y)  on  all  the  examples  is  the 
averaging  passing  probability  of  all  possible  pv  (for  any  v  £  V")  on  test  3~m. 

Folding  Trick:  We  use  the  “folding  ”  technique  that  is  similar  to  [60,  106].  The  pro¬ 
cedures  are  described  as  follows:  instead  of  output  pair  ( y,b )  in  the  last  step  of  above 
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reduction,  we  output  iy1 ,  b )  where  y’  is  the  projection  of  y  into  some  subspace  H 1  (defined 
later).  By  folding,  we  are  able  to  enforce  the  p(y)  to  have  the  same  value  on  different 
points  in  udim  as  long  as  their  projection  on  H 1  is  the  same.  It  is  easy  to  see  the  projection 
can  be  done  in  polynomial  time. 

We  define  the  subspace  H  and  HL  for  our  folding  as  follows: 

Definition  10.3.4.  For  every  e  =  {u,v)  eE,i  £  [k\  b{e,  i )  £  [Rdim  is  the  vector  with  0  at  every 
coordinate  except  that  b(e,i =  1  and  for  every  j  e  (7re)_1(i),  b(e,j)[ ^  =  -1.  Let  B  to  be  the 
collection  of  all  such  be,t:  B  =  {b(e,i)  \  e  =  (u,v)  e  E,i  e  [k\,j  e  ineYl(i)}.  Define  H  =  span(B) 
and  H 1  to  be  the  orthogonal  complement  ofH  in  Rdim. 

After  folding,  we  can  further  enforce  pix)  to  have  following  “folding”  property: 

For  any  h  £  B  and  c  £  R  ,p(x  +  ch )  =  pix). 

and  we  call  function  that  has  above  property  folded.  In  particular  for  e  =  iu,v)  £  E  and 
i  £  [£],  p(x  +  rb(e,i))  =  pix).  If  we  view  piy)  as  a  polynomial  only  on  and  yvj)  for 
j  £  ( 7re )  1  ( i )  and  apply  Lemma  10.5.1,  we  have  that 

4’’=  E  4’. 

M^y1 

If  we  sum  over  all  possible  i,  this  implies  for  any  edge  ( u,v ), 

iek  iem 

Now  we  prove  our  main  result,  Theorem  10.1.4.  Recall  the  hardness  result  of  Label- 
COVER  as  follows  [128]: 

Theorem  2.5.2  There  exists  some  constant  77  such  that  it  is  NP-Hard  to  distinguish  the 
following  two  cases: 

•  OptCS?)  =  l; 

•  Opt(^f)  ^  1/m11. 

We  will  show  the  following  two  properties  of  the  reduction  to  complete  the  proof. 
Theorem  10.3.5.  (Completeness)  If  Opt(^f)  =  1,  there  is  a  folded  function  pix)  that  is 
consistent  with  1  -  l/lo gim)  fraction  of  the  points. 

Theorem  10.3.6.  (Soundness)  If  Opt(5£)  f  1/m11,  there  is  no  folded  degree  2  polynomial 
function  consistent  with  |  +  lo^m)  fraction  of  the  data. 

Combing  Theorem  10.1.4,  10.3.5,  10.3.6  and  notice  that  m  can  be  arbitrary  big  num¬ 
ber  (e.g.  e1/c  ),we  can  easy  to  get  Theorem  10.1.4.  (we  also  use  a  similar  discretization 
argument  in  Section  10.2.3) 

Following  is  the  proof  of  above  Theorem  10.3.5,  10.3.6. 

Proof  of  Theorem  10.3.5 
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Proof.  If  Opt(^f)  =  1,  suppose  there  is  a  labeling  l  satisfying  all  the  edges.  Then  consider 
function 

Piy)  =  E  yw(l(w))- 

weU  uV 

Notice  that  for  every  v  e  V,  pv  is  a  dictator  and  passes  3~m  with  probability  at  least  1  - 
by  Lemma  10.3.1.  Therefore  p  passes  with  probability  at  least  1  -  l/log(m). 

It  is  also  easy  to  check  that  p(x)  is  folded.  □ 


Proof  of  Theorem  10.3.6 

Proof.  The  proof  is  by  contradiction.  Suppose  there  is  some  folded  degree  2  polynomial 
p(x)  such  that  sgn(p(x))  agrees  with  more  than  |  fraction  of  the  example,  i.e.,  the 

averaging  passing  probability  of  pv  on  3~m  is  |  +  By  an  averaging  argument,  we  know 

for  fraction  of  the  v  eU,  pv  passes  the  test  3\  with  probability  \  and  we  call 

such  a  v  “good”  vertex.  Also  we  call  an  edge  “good”  if  one  of  the  endpoint  of  the  edge  is  a 
good  vertex.  By  the  regularity  of  the  graph,  we  know  at  least  fraction  of  the  edges  are 
“good”. 

For  a  “good”  vertex  v,  define 

m 

i~  1 

then  by  Theorem  10.3.2,  |/„|  ^  (logm)2.  For  every  u  eU,  we  define  Ju  =  {j  :  j  £  [k],c^  ^ 

TielkA^). 

Notice  Ju  is  not  empty  as 

maxc^  E  °u[ifk. 

J  ie[&] 

We  define  the  following  labeling  strategy  for  5£ .  For  u  eU,  randomly  assign  it  a  label 
from  Ju;  for  v  £  V,  we  randomly  assign  it  a  label  from  Iv  (if  Iv  is  empty,  just  assign  any 
label). 

For  every  good  edge  e  =  (u,v)  and  any  j  e  Ju ,  by  folding,  we  have 

E  cf=cf>  14°*=  E  of ik. 

iEne^(J)  ie\.k\  ie[m ] 

There  is  at  least  one  label  i  in  7r~  1(j)  such  that  'Lie[m\cv^km  ^  Zie[mj Vm2,  and  it  is 
therefore  in  Iv.  Notice  that  Iv  ^  (logm)2,  by  our  randomized  labeling  strategy,  we  have 
l/(logm)2  chance  to  satisfy  edge  (u,v). 

Therefore  above  labelling  strategy  satisfy  (in  expectation)  at  least  l/(log(m)2)  fraction 
of  the  good  edges  and  l/(logm)3  fraction  of  the  total  edges.  For  large  enough  m,  this 
contradicts  with  the  fact  that  Opt(^f)  ^  1/m 11 .  □ 
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10.4  Probability  Inequalities 


The  second  fact  is  from  [27], 

Lemma  10.4.1.  Let  fix) :  IR"  — > ►  [R  be  a  degree-d  polynomial  function,  then  for  x±,X2,  ■  ■  ■  ,xn 
being  independent  unit  Gaussian, 


Pr[|/\xi,. .  .,xn)\  ^  a]  ^  0{d'E\_fixi,...,xn)2]  ™  alld). 


where  x\,X2,---,xn  are  independent  standard  Gaussian. 

Lemma  10.4.2.  For  a  degree  d  polynomial  f  =  Lmulti8etS,\SKdjSs\»n]  CS  flies  %i  and  for  any 


multset  T  c  [n\,  \\f\\2  =  F,[fix\,...,xn)2]^  ^ 


\f(T)\ 

rdd)dd • 
2 


Proof.  Given  f,  one  way  to  calculate  its  II/2II2  is  to  convert  it  into  its  hermite  expansion 

A  A  Q 

£g/(S)xs  where  %S  is  the  Hermite  polynomial.  Then  the  variance  is  J^fiS)  . 

of  the  form  j,g(x)  =  Lr&s  hg  n ieTxi  where  hg  is  the  coefficients  of  the  Hermite  polyno¬ 
mial  hs- 


We  know  that  ct  can  be  written  as  the  sum  of  'Lr^s^gfiS).  There  are  at  most  ( n +d  ) 
term  in  the  summation. . . ,  Also  every  is  some  constant  only  depending  on  d  actually 
it  is  not  hard  to  bound  it  by  dd.  Therefore,  there  must  be  at  least  one  hermite  coefficients 


f(S)  that  has  absolute  value  bigger  than 


\f(T)\ 

ddnd ) 


and  this  give  a  lower  bound  for  || /* || 2-  □ 


10.5  Folding  Lemma 


Lemma  10.5.1. 

n 

pix)  =  6  +  Y  WiXi  +  Y  WijXiXj 

i= 0  O^i^j^n 

is  a  degree  2  function.  If  for  any  c  e  K  and  fix  +  c(  1,-1, . . . ,  -1))  =  fix),  then  wo  =  L”=1  wt. 
Proof.  We  know  that 


n  n 

9  +  Woixo  +  c)  +  Y  U>iiXi-c)  +  U>ooiXO  +  c)2  +  Y  wOjixo  +  c)ixj-c)+  Y  WijiXi-c)iXj-c) 

1=1  7=1 

n 

=  9+Y  wixi  +  Y  WijXiXj. 

i=0 

Notice  that  above  equation  hold  for  any  c,x.  Therefore  if  we  express  left  hand  and  right 
hand  as  polynomials  of  variable  c,x q,x\,..  .  ,xn,  the  corresponding  coefficients  should  be 
the  same.  If  we  look  at  the  coefficients  of  the  term  c,  we  have 

n 

U>0  —  Yj  OJi  —  0. 
i=  1 


□ 
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10.6  Discretization  of  the  Gaussian  distribution 


Theorem  10.6.1.  There  is  a  probability  distribution  on  (f§  ,J(?n)  ~  R2  such  that  the  marginal 
distribution  on  <£  follows  the  standard  Gaussian  distribution  and  the  marginal  distribu¬ 
tion  of  follows  the  distribution  of  the  summation  of  N  random  bits;  i.e.,  =  Y^-\b[ 

where  each  b;  is  independent  random  bits  from  {-1, 1}.  In  addition,  and  <S  are  close  in 
the  following  sense:  with  probability  at  least  1  -  0(— T),  1^  -  '^t=\  ^  0(-^)). 

N  4  VN  N  4 

Proof  Let  O  be  the  CDF  (Cumulative  distribution  function)  of  and  T  be  the  CDF  of 
the  marginal  distribution  on  (3  (i.e.  standard  Gaussian  Distribution). 

We  can  couple  random  variables  in  the  following  ways:  first  we  sample  ho  from 

the  marginal  distribution  on  JPjy.  We  know  that 

Pr(^v  =  h0)  =  T(/i0)  -  T(/i0  -  2) 

since  if  ho  is  a  feasible  outcome  of  summing  N  bits,  then  ho  -2  is  the  biggest  feasible 
outcome  less  than  ho  (if  there  is  any).  Then  we  generate  ^  by  keep  drawing  random  sam¬ 
ples  from  the  Gaussian  Distribution  until  the  sample  falls  into  the  interval  (vF_1(0(/io  - 
2)),vF-1(0(/io)]  and  we  set  its  value  to  be  CS. 

By  above  construction,  we  claim  ^  must  follows  the  Gaussian  distribution:  essen¬ 
tially  we  use  the  value  of  ho  as  a  indicator  of  whether  ^  is  in  the  interval  (T-1!®!/^  - 
2)),  vF-1(®(/io)L  We  also  need  to  check  that  Pr (h  =  ho)  =  Pr(^  e  (T/_1(®(/io-2)),  vF-1(®(/io)]. 
This  is  true  because 

Pr (h  =  h0)  =  Pr (h  e  (h0  -  2 ,h0\  =  ®(fco)  -  ®(Ao  -  2) 

=  Pr (g  e  (T-^Ol/io  -  2)),  'p-1(O(/i0)]). 

By  above  coupling  of  ^  and  it  remains  to  prove  for  any  number  in  the  interval  (T/_1(®(/io- 
2)),vF-1(®(/io))]  is  close  to  ho/\/n  for  most  of  the  ho  generated. 

It  suffice  to  check  the  following  two  inequalities: 

•  W-HdXho))-^^^; 

•  |T-1(®(/io-2))-^|K^r. 

VN 

Suppose  ho  >  0  and  we  will  just  prove  the  second  one.  By  the  Berry  Esseen  Theorem,  we 
know  that  |®(/io  +  2)  -  T,(^=y)l  ^  ^=. 

Therefore, 


+  2))  ^  T^TT^-^)  +  — - )  ^  ^3-  +  0(  (10.30) 

VN  VN  VN  g-(s+2)2/2 

Notice  that  when  ho  ^  3^3’  (10-30)  is  bounded  by  0(^74).  Also  by  Chernoff  Bound,  we 
know  that  when  Pr(J5?  >  \333~)  ^  O(-lj-). 

V  ^  AT  A 
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Therefore,  except  with  probability  0(-^-),  YS  -  ^  0(-^-) 

N  4  N  4 


□ 


By  above  theorem,  we  know  that  we  can  construct  a  distribution  that  is  point-wise 
close  a  Gaussian  distribution  (.§  with  high  probability  Now  we  will  use  the  constructed 
distribution  to  discretize  the  high  dimension  Gaussian  space  for  low  degree  PTFs. 
Theorem  10.6.2.  For  any  degree  D  polynomial  fix  i,...,xn)  =  L|S|^z>/GS)rhes£j-  Here  D 
is  some  constant  that  does  not  depend  on  n.  Let  iy,z)  e  IR"  x  [Rn  be  generated  by  sample 
n  times  i.i.d  from  the  distribution  where  we  take  N  =  n24D  as  is  the  set  up  of 

Theorem  10.6.1. 

Then 

Pr(sgn(/(y))  f  sgn(/(z))  <  O(-). 

n 

Proof.  Without  loss  of  generality,  let  us  assume  that  l/(S)|  =  1.  By  Lemma  10.4.2,  we 
know  that  11/211  ^J)Dd  • 

By  union  bound  and  Theorem  10.6.1,  we  know  that  with  probability  1  -  =  1  -  O(^), 

we  have  that  for  every  i  e  [n\,  \xi  -yf  ^  ^74 

Similar  to  the  calculation  in  (10.3),  when  y  and  2  are  close  on  each  coordinate,  we  have 
that 

\f(y)  ~  fiz)\  ^  -\oin2L>2)  ^  ^ 

ATi  n6L> 

Then 


Pr(sgn(/(y))  /  sgn(/(z))  ^  Pr(/(y)  ^  \f(z)  -  f(y) |) 


^0(-)  +  Pr(/(yK 

n 


1 

~n^ 


)  (10.31) 


By  Lemma  10.4.1,  we  can  bound 


byO(i). 


Pr(/(y)  < 


n 


Overall  we  bound  the  probability  of  Pr(sgn(/(y))  /  sgn(/(2))  by  O(^). 


□ 


Remark  10.6.3.  Above  theorem  immediately  implies  that  can  be  used  to  fool  low 

degree  PTFs  over  cS®n.  Also  the  distribution  of  ,  by  definition,  can  be  generated  with 

n24D2  +  l  random  JjHs 
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Part  IV 

Open  Problems 
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Chapter  11 
Open  Problems 
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Numerous  problems  are  unsolved  on  understanding  the  approximability  of  NP-hard 
problems.  One  of  the  most  important  one  is  to  prove  or  disprove  the  Unique  Games  Con¬ 
jecture  as  well  the  d-to-1  conjecture.  In  addition  to  that,  I  list  the  open  problems  that  I 
found  intriguing,  and  hard  to  solve,  during  the  writing  of  my  thesis. 

Efficient  SDP  Rounding  for  CSPs:  In  the  study  of  Max  Cut,  we  gave  an  SDP  round¬ 
ing  algorithm  with  running  time  poly(n)  •  2poly(1,/e).  Can  we  improve  its  running  time  to 
poly(ra)  •  poly(l/e)?  Can  we  even  give  an  efficient  rounding  algorithm  for  general  CSPs? 
In  the  work  of  [125],  the  author  gave  a  generic  SDP  rounding  algorithm  for  almost  every 
CSP  with  running  time  poly(n)-22P  l11.  Can  we  improve  such  a  running  time  to  make  the 
algorithm  more  practical? 

NP-hardness  for  Satisfiable  3-CSP:  Can  we  prove  that  Max  3-CSP  (1,5/8  +  e )  is  NP- 
hard  without  assuming  the  d-to-1  conjecture? 

Hardness  of  Approximating  Satisfiable  CSPs:  Can  we  establish  a  more  general  re¬ 
sult  on  the  approximability  of  satisfiable  CSPs?  In  particular  can  we  prove  or  disprove  the 
following  conjecture: 

Conjecture  11.0.4.  Let  O  be  a  predicate  set.  For  the  problem  of  Max  d>,  GapTegt(l),  which 
is  the  optimal  soundness  of  the  Dictator  Test  using  predicates  from  set  cp  with  perfect  com¬ 
pleteness,  is  equal  to  the  optimal  approximation  ratio  for  Max  O  when  the  instance  is 
satisfiable. 

SDP  gap  for  2-to-l  Label-Cover:  Can  we  construct  instances  of  2-to-l  Label-Cover 
with  SDP  value  1  and  optimum  value  1  /Re(1'>  where  R  is  the  alphabet  size?  More  desirably, 
can  we  obtain  such  a  gap  under  even  stronger  form  of  SDP  for  2-to-l  Label-Cover? 

NP  hardness  results  of  MA-MON-PTF^  (1  -  e,  1/2  +  e):  Can  we  show  the  that  even 
there  exists  a  monomials  that  is  consistent  with  0.99  fraction  of  the  data,  it  is  hard  to  find 
a  low  degree  PTF  that  is  consistent  with  0.51  fraction  of  the  examples.  Such  a  hardness 
result  would  subsume  almost  all  the  previous  results  on  hardness  of  agnostic  learning  and 
strengthen  the  belief  that  learning  tasks  under  agnostic  noises  over  arbitrary  distribution 
are  essentially  hard. 
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